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Abstract 

CO 

^ Model comparison for the purposes of selection, averaging and validation is a problem found 

^ throughout statistics and related disciplines. Within the Bayesian paradigm, these problems 

CN all require the calculation of the posterior probabilities of models within a particular class. 

5—1 Substantial progress has been made in recent years, but there are numerous difficulties in 

^ the practical implementation of existing schemes. This paper presents adaptive sequential 

Monte Carlo (SMC) sampling strategies to characterise the posterior distribution of a coUec- 
^ tion of models, as well as the parameters of those models. Both a simple product estimator 

^ and a combination of SMC and a path sampling estimator are considered and existing the- 

oretical results are extended to include the path sampling variant. A novel approach to the 
automatic specification of distributions within SMC algorithms is presented and shown to 
outperform the state of the art in this area. The performance of the proposed strategies is 
demonstrated via an extensive simulation study making use of the Gaussian mixture model 
^ and two challenging realistic examples. Comparisons with state of the art algorithms show 

^ that the proposed algorithms are always competitive, and often substantially superior to al- 

^ c/^^ ternative techniques, at equal computational cost and considerably less application-specific 

implementation effort. 

K*" KejTvords: Adaptive Monte Carlo algorithms; Bayesian model comparison; Normalising 

constants; Path sampling; Thermodynamic integration 

^ 1 Introduction 

Model comparison, selection and averaging lie at the core of Bayesian decision theory 
(Robert, 2007) and have attracted considerable attention in the past few decades. In most 

;v cases, approaches to the calculation of the required posterior model probabilities have re- 

. ^ volved around simple asymptotic arguments or the post-processing of outputs from Markov 

chain Monte Carlo (MCMC) algorithms operating on the space of a single model or using 

H specially designed MCMC techniques that provide direct estimates of these quantities (for 

example Reversible Jump MCMC, RJMCMC; Green (1995)). Within-model simulations are 
typically somewhat simpler, but generalisations of the harmonic mean estimator (Gelfand 
and Dey, 1994) which are typically used in this setting require careful design to ensure 
finite variances and, convergence assessment can be rather difficult. Simulations on the 
whole model spaces are often difficult to implement efficiently even though they can be 
conceptually appealing. 

More robust and efficient Monte Carlo algorithms have been established in recent years. 
Many of them are population based, in that they deal explicitly with a collection of sam- 
ples at each iteration, including sequential importance sampling and resampling (AIS, Neal 
(2001); SMC, (Del Moral et al., 2006b)) and population MCMC (PMCMC; Liang and Wong 
(2001); Jasra et al. (2007a)). However, most studies have focused on their abilities to 
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explore high dimensional and multimodal spaces. Results on the effectiveness of these algo- 
rithms when applied to Bayesian model comparison problems are less well studied. In the 
present work, we motivate and present a number of approaches based around the sequen- 
tial Monte Carlo family of algorithms, and demonstrate the effectiveness of the proposed 
strategy empirically. 

Sequential Monte Carlo (SMC) methods are a class of sampling algorithms which com- 
bine importance sampling and resampling. They have been primarily used as "particle fil- 
ters" to solve optimal filtering problems; see, for example, Cappe et al. (2007); Doucet and 
Johansen (201 1) for recent reviews. They are used here in a different manner, that proposed 
by Del Moral et al. (2006b) and developed by Del Moral et al. (2006a); Peters (2005). This 
framework involves the construction of a sequence of artificial distributions on spaces of 
increasing dimensions which admit the distributions of interest as particular marginals. 

Although it is well known that SMC is well suited for the computation of normalising 
constants and that it is possible to develop relatively automatic SMC algorithms by em- 
ploying a variety of "adaptive" strategies, their use for Bayesian model comparison has not 
yet received a great deal of attention. We highlight three strategies for computing poste- 
rior model probabilities using SMC, focusing on strategies which require minimal tuning 
and can be readily implemented requiring only the availability of locally-mixing MCMC pro- 
posals. These methods admit natural and scalable parallelisation and we demonstrate the 
potential of these algorithms with real implementations suitable for use on consumer-grade 
parallel computing hardware including GPUs, reinforcing the message of Lee et al. (2010). 
We also present a new approach to adaptation and guidelines on the near-automatic imple- 
mentation of the proposed algorithms. These techniques are applicable to SMC algorithms 
in much greater generality. The proposed approach is compared with state of the art alter- 
natives in extensive simulation studies which demonstrate its performance and robustness. 

In the next section we provide a brief survey of the considerable literature on Monte 
Carlo methods for Bayesian model comparison. Section 3 presents three algorithms for 
performing model comparison using SMC techniques and Section 4 provides several illus- 
trative applications, together with comparisons with other techniques. Section 5 extends 
the standard SMC central limit theorem to include the path sampling estimator. The paper 
concludes with some discussions in Section 6. 

2 Background 

Bayesian model comparison must be based upon the posterior distribution over models. 
It is only possible to obtain closed-form expressions for posterior model probabilities in 
very limited situations. Over the past five decades, this problem has attracted considerable 
attention. It is not feasible to exhaustively summarise this literature here. We aim only to 
describe the major contributions to the area and recent developments which are particularly 
relevant to the present paper. 

2.1 Analytic Methods and MCMC 

The Bayesian Information Criterion (BIC), developed by Schwarz (1978), is based upon a 
large sample approximation of the Bayes factor It is defined as BIC = —2£ + klog{n), where 
£ denotes the maximum of the log-likelihood for the observed data, k the number of model 
parameters and n the effective dimension of the data. An asymptotic argument concerning 
Bayes factors under appropriate regularity conditions justifies the choice of the model with 
the smallest value of BIC. Although appealing in its simplicity, such an approach can only 
be formally justified when a very large number of observations (compared to the number of 
parameters) is available. 

In this era of fast computing, the difficulty of evaluating the integrals that must be com- 
puted in order to adapt a fully Bayesian approach is much smaller than it once was. The 
Bayesian approach to model comparison is, of course, to consider the posterior probabilities 
of the possible models (Bernardo and Smith, 1994, Chapter 6). Within this Bayesian frame- 
work the decision making process, which might include model comparison, model selection 
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or the choice of an action, depend upon the relative probabihties of several models (Robert, 
2007, Chapter 7). 

Given an (at most countable) collection of models {Mk}keic, with model Mk having 
parameter space 6^, Bayesian inference proceeds from a prior distribution over the collec- 
tion of models, 7r(Mfc), a prior distribution for the parameters of each model, Tr{Ok\Mk) and 
the likelihood under each model piy\9k, Mk). In order to perform model comparison, one 
requires the posterior model probability, 

.(M.I,) = «^ (1) 

where p{y\Mk) = Jg^ p{y\9k, Mk)'n'i0k\Mk) d 9k is termed the evidence for model Mk and the 
normalising constant = J2keKP(y\^^k)'^{Mk) can be easily calculated if |/C| is finite and 
the evidence for each model is available. The case where |/C| is countable is discussed later. 
We first review some techniques for calculating the evidence for each model individually. 

Several techniques have been proposed to approximate the evidence for a model using 
simulation techniques which approximate the posterior distribution of that model, including 
the harmonic mean estimator of Newton and Raftery (1994); Raftery et al. (2006) and the 
generalisations due to Gelfand and Dey (1994). These pseudo-harmonic mean methods are 
based around the insight that for any density g, such that g and the posterior density are 
mutually absolutely continuous, the fact that following identity holds, 

f 9{9k) , M\AO - f p{y,9k\Mk) _ 1 

and by approximating the leftmost integral using any numerical integration technique one 
can in principle obtain an estimate of the evidence. Unfortunately, considerable care is 
required in the implementation of such schemes in order to control the variance of the 
resulting estimator (and indeed, to ensure that this variance is finite; see for example Neal 
(1994)). 

In the particular case of the Gibbs sampler, Chib (1995) provides an alternative approach 
to the approximation of the evidence from within-model simulations based on that the iden- 
tity, 

P(y|M.) = ?(*;i«f^, (3) 

■K{()k\y,Mk) 

holds for any value of 9k- Therefore, an estimator of the marginal likelihood can be obtained 
by replacing 9k with a particular value, say 9^^, which is usually chosen from the high prob- 
ability region of the posterior distribution and approximating the denominator 7r((?^|y, Mk) 
using the output from a Gibbs sampler. Though this method does not suffer the instability 
associated with generalised harmonic mean estimators, it requires that all full conditional 
densities are known including their normalising constants. This approach was generalised 
to other Metropolis-Hastings algorithms by Chib and Jeliazkov (2001), where only the pro- 
posal distributions are required to be known including their normalising constants. 

The first MCMC method which operated simultaneously over the full collection of mod- 
els of interest providing direct estimates of posterior model probabilities was probably the 
approach of Grenander and Miller (1994). However, the general Reversible Jump MCMC 
(RJMCMC) strategy first proposed by Green (1995) is undoubtedly the most widespread 
of these techniques. RJMCMC adapts the Metropolis-Hastings algorithm to construct a 
Markov chain on an extended state-space which admits the posterior distribution over both 
model and parameters as its invariant distribution. The algorithm operates on the space 
UfceA:({Mfc} X 6fe). A countable set of types of moves are considered, say m e M, and each 
move type m is capable of moving between two models, say Mk and Mk' (where k — k' 
in the case of within-model moves). At state 6k, a move tj^e m together with a new state 
dk' are proposed according to q,n{9k,9k')rm{9k), where rm{9k) is the probability of choosing 
type m move when at state dk and qm{9k, 6k') is the proposal kernel for the new state when 
move type m is made. The move is accepted with probability 

... X . TTiMk')7r{9k'\Mk')piy\9k',Mk')qm{9k',9k)r^{9'^) \ 
am[k,k') mm| , ^^Mk)7ri6k\Mk)piy\6k, Mk) qmi9k,9k')rmi9k) }' 
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In practice, sampling of the proposed new state 9k' is often achieved by drawing a vector 
of continuous random variables, say u, which are independent of 9k and applying a deter- 
ministic mapping of vector {9k, u) to 9k'. The inverse of the move, from 9k' back to 9k, then 
uses the inverse of this transformation. Through a simple change of variable, the condi- 
tional density g„ {9k , 9k' ) can be expressed in terms of the density of vector u, q{u), and the 
acceptance probability becomes 



.... . ,, 7r{Mk'M9k'\Mk')p{y\9k',Mk') rm{9k') 1 
am{t^k,t^k')-mm<^i, ^^MkM9k\Mk)p{y\9k,Mk) Trn{9k) q{u) 



d9k' 
d{9k,u) 



(5) 



where the last term is the determinant of the Jacobian of the transformation. The design 
of efficient between-model moves is often difficult, and the mixing of these moves largely 
determines the performance of the algorithm. For example, in multimodal models, where 
RJMCMC has attracted substantial attention, information available in the posterior distribu- 
tion of a model of any given dimension does not characterise modes that exist only in models 
of higher dimension, and thus successful moves between those models become unlikely and 
difficult to construct (Jasra et al., 2007b). In addition, RJMCMC will not characterise mod- 
els of low posterior probability well, as those models will be visited by the chain only rarely. 
In some cases it will be difficult to determine whether the low acceptance rates of between- 
model moves result from actual characteristics of the posterior or from a poorly-adapted 
proposal kernel. 

The related continuous time birth and death algorithm of Stephens (2000) was shown 
by Cappe et al. (2003) to have no qualitative advantage over the simpler discrete time for- 
mulation. A post-processing approach to improve the computation of normalising constants 
from RJMCMC output using a bridge-sampling approach was advocated by Bartolucci et al. 
(2006). Sophisticated variants of these algorithms, such as those developed in Peters et al. 
(2010), have been considered in recent years but depend upon essentially the same con- 
struction and ultimately require adequate mixing of the underlying Markov process. 

Carlin and Chib (1995) presented an alternative method for simulating the model prob- 
ability directly through a Gibbs sampler on the space {Mk}keic x TlfceK; '^^^ joint P^" 
rameter is thus (M, 9) where 9 is the vector {9k)keK. and conditional on model Mk the data 
y only depends on a subset, 9k, of the parameters. To form the Gibbs sampler, a so called 
pseudoprior Tr{9k\M =/= Mk) in addition to the usual prior 'K{9k\Mk) is selected, such that 
given the model indicator M, the parameters associated with different models are condi- 
tionally mutually independent. In this way, a Gibbs sampler can be constructed provided 
that all the full conditional distributions ■n{9k\y,9k'^k,M) and 7r(M = Mk\y,9) ior k e K. 
are available. The major drawback of this approach is that the performance and validity of 
the sampler is very sensitive to the selected pseudopriors, and as usual for all Gibbs sam- 
plers, the full conditional distribution needs to be readily sampled from. This approach was 
later generalised by Godsill (2001) who also explored the connection with RJMCMC. 

Overall, the methods reviewed above either demand some knowledge of the target dis- 
tributions that is often missing in reality, or require substantial tuning in order for the algo- 
rithms to perform well. 



2.2 Recent Developments on Population-Based Methods 

In the recent computational statistics literature, there has been a tendency to consider the 
use of population-based sampling methods. There are two popular approaches among many 
others. One is based on sequential importance sampling and resampling, such as the SMC 
sampler (Del Moral et al., 2006b) and earlier development of annealed importance sam- 
pling (AIS; Neal (2001)) which can be viewed as a special case of SMC. Another approach 
is population MCMC (PMCMC; Marinari and Parisi (1992); Geyer (1991); Liang and Wong 
(2001)) also known as parallel tempering, which uses a collection of parallel MCMC chains 
to approximate a target distribution. PMCMC operates by constructing a sequence of dis- 
tributions {TTtj^o with TTo corresponding to the target distribution and successive elements 
of this sequence consisting of distributions from which it is increasingly easy to sample. A 
population of samples is maintained, with the element of the population being approx- 
imately distributed according to tt^; the algorithm proceeds by simulating an ensemble of 
parallel MCMC chains each targeting one of these distributions. The chains interact with 
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one another via exchange moves, in which the state of two adjacent chains is swapped, and 
this mechanism allows for information to be propagated between the chains and hopefully 
for the fast mixing of ttt to be partially transferred to the chain associated with ttq. The out- 
puts are samples that approximate the product H^o ""t which admits the target distribution 
as its first coordinate marginal. 

There is evidence in the literature of substantial interest in the potential of using popu- 
lation based methods to explore high dimensional and multimodal parameter spaces which 
were previously difficult for conventional MCMC algorithms. Jasra et al. (2007a) compared 
the performance of the two approaches in this context. There are also increasing interest 
of using these methods for Bayesian model comparison. The PMCMC outputs can be post- 
processed in the same way as conventional MCMC to obtain estimates of evidence for each 
model (for example using a generalised harmonic mean estimator) . However, this approach 
inherits many of the disadvantages of this estimator. Jasra et al. (2007b) combined PM- 
CMC with RJMCMC and thus provide a direct estimate of the posterior model probability. 
Another approach is to use the outputs from all the chains to approximate the path sam- 
pling estimator (Gelman and Meng, 1998), see Calderhead and Girolami (2009). However, 
the mixing speed of PMCMC is sensitive to the number and placement of the distributions 
{TTtj^Lo (see Atchade et al. (2010) for the optimal placement of distributions in terms of 
a particular mixing criterion for a restricted class of models) . As seen in Calderhead and 
Girolami (2009), the placement of distributions can play a crucial role in the performance 
of the estimator, a topic we will revisit in Section 4. 

The use of AIS for computing normalising constants directly and via path sampling dates 
back at least to Neal (2001); see Vyshemirsky and Girolami (2008) for a recent example 
of its use in the computation of model evidences. In the literature it has generally been 
suggested that more general SMC strategies provide no advantage over AIS when the nor- 
malizing constant is the object of inference. Later we will demonstrate that this is not 
generally true, adding improved robustness of normalizing constant estimates to the advan- 
tages afforded by resampling within SMC. We will also discuss more details on the use of 
SMC and path sampling for Bayesian model section in the next section. The use of PMCMC 
coupled with path sampling was discussed in Vyshemirsky and Girolami (2008). 

Jasra et al. (2008) developed a method using a system of interacting SMC samplers for 
trans-dimensional simulation. The targeting distribution tt and its space S are the same as in 
RJMCMC. As usual in SMC, a sequence of distributions {^t}?Lo with increasing dimensions 
are constructed such that ttt admits tt as a marginal. The algorithm starts with a set of SMC 
samplers with equal number of particles; each of them targets Wi^tix) oc TTt{x)l{x e Si^t) 
up to a predefined time index t*, such that {Sifi} is a partition of S and Si^t* = S. At 
time t* particles from all samplers are allowed to coalesce, and from this time on, all of 
them are iterated with the same Markov kernel until the single sampler reaches the target 
TT. Each individual sampler explores only a portion of the parameter space and by using 
the information which each sampler gains about that region of the parameter space, with a 
properly chosen t*, the resulting sampler will be able to explore the whole space efficiently. 
One of the three algorithms detailed in the next section coincides, essentially, with the final 
stage of the approach of Jasra et al. (2008); the other algorithms which are developed rely 
on a quite different strategy. 

A proof of concept study in which several SMC approaches to the problem were oudined 
was provided by Zhou et al. (2012) and these approaches are developed below. These strate- 
gies based around various combinations of path sampling (Gelman and Meng, 1998) and 
SMC (as used by Johansen et al. (2006) in a rare events context and by Rousset and Stoltz 
(2006) in the context of the estimation of free energy differences) or the unbiased estima- 
tion of the normalizing constant via standard SMC techniques (Del Moral, 1996; Del Moral 
etal., 2006b). 

A number of other recent developments should be mentioned for completeness, but are 
not directly relevant to the problems considered here. A strategy for SMC-based variable 
selection was developed by Schafer and Chopin (2013); this approach depends upon the 
precise structure of this particular problem and does not involve the explicit computation of 
normalizing constants. 

In recent years, several approaches to the problem of Bayesian model comparison in 
settings in which the likelihood cannot be evaluated have also been proposed: Grelaud 
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et al. (2009); Didelot et al. (2011); Robert et al. (2011). This class of problems falls outside 
the scope of the current paper. We will assume throughout that the likelihood of all models 
can be evaluated point wise. 

2.3 Challenges for Model Comparison Techniques 

We conclude this background section by noting that there are a number of desirable fea- 
tures in algorithms which seek to address any model comparison problem and that these 
desiderata can find themselves in competition with one another. One always requires ac- 
curate evaluation of Bayes factors or model proportions and to obtain these one requires 
estimates of either normalizing constants or posterior model probabilities with small error 
making the efficiency of any Monte Carlo algorithm employed in their estimation critical. If 
one is interested in characterising behaviour conditional upon a given model or even calcu- 
lating posterior-predictive quantities, it is likely to be necessary to explore the full parameter 
space of each model; this can be difficult if one employs between-model strategies which 
spend little time in models of low probability. In many settings end-users seek to interpret 
the findings of model selection experiments and in such cases, accurate characterisation of 
all models including those of relatively small probability may be important. 



3 Methodology 

SMC samplers allow us to obtain, iteratively, collections of weighted samples from a se- 
quence of distributions {irt}f^Q over essentially any random variables on some measurable 
spaces {Et,£t), by constructing a sequence of auxiliary distributions {7rt}^o spaces of 
increasing dimensions, 

t-i 

n(.XO:t) = M^t) Yl Ls{Xs+l, Xs), (6) 
s=0 

where the sequence of Markov kernels {Ls}*~g, termed backward kernels, is formally arbi- 
trary but critically influences the estimator variance. See Del Moral et al. (2006b) for further 
details and guidance on the selection of these kernels. 

Standard sequential importance resampling algorithms can then be applied to the se- 
quence of synthetic distributions, {TTtjJ^Q. At time t = n — 1, assume that a set of weighted 
particles {W'^'iij-^o^i-ili^i approximating 7f„_i is available, then at time t = n, the path 
of each particle is extended with a Markov kernel say, i4r„(x„_i, a;„) and the set of particles 

{^ol}iLi reach the distribution r]n{xo;„) = rio{xo) JltLi Kt{xt-i,xt), where % is the ini- 
tial distribution of the particles. To correct the discrepancy between rjn and importance 
sampling is then applied, that is the weights are corrected by 

W I \ '^n{xO:n) irn{Xn)]XlZl L ^{x s+l, X s) . . ^ 

Wn{XO:n) OC = ' CX Wn-l{XO:n-l)Wn{Xn-l,Xn) (7) 

"nnyXQ-.n) VO\Xo) llf^i J^t\Xt-l,Xt) 

where Wn, termed the incremental weights, are calculated as, 

— / \ ^n('^n)-^n— 1 (-^n; •^n— l) 

Wn{Xn-\,Xn) = yT^, ^ (8) 

li-Kn is only known up to a normalizing constant, say 7r„(a:^) = 7„(x^)/Z^, then we can use 
the unnormalised incremental weights 

Wn{Xn-l,Xn) = 7 rr^, (9) 

for importance sampling. Further, with the previously normalised weights {VF^'i^lili, we 
can estimate the ratio of normalizing constant Zn/Zn-i by 

^ =Y^wi'i,wn{x^:Uj, (10) 
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Table 1: Strengths of computational strategies for model choice. PMCMC admits parallelisation 
up to the number of chains used, but is not a natural candidate for implementation on massively- 
parallel architectures. 



and 

|=ii#T=riE<w^^i:.), (11) 

^ p=2 P~'- p=2 i=l 

provides an unbiased (Del Moral, 2004, Proposition 7.4.1) estimate of Sequentially, 
the normalizing constant between initial distribution ttq and target ttt can be estimated. 
See Del Moral et al. (2006b) for details on calculating the incremental weights in general; 
in practice, when Kn is 7r„ -invariant, 7r„ <C 7r„_i, and 

n-l{x„,Xn-l) = ^ , \ (12) 

\Xn ) 

is used as the backward kernel, the unnormalised incremental weights become 

Wn{x„-l,Xn) = 7 r. (13) 

7„_i(a;„_i) 

This will be the situation throughout the remainder of this paper. 



3.1 Sequential Monte Carlo for Model Comparison 

The problem of interest is characterising the posterior distribution over {Mk}keic, a set of 
possible models, with model Mk having parameter vector 9^ G 6^ which must also usu- 
ally be inferred. Given prior distributions n{Mk) and TT{6k\Mk) and likelihood p{y\dk,Mk) 
we seek the posterior distributions Tr{Mk\y) a p{y\Mk). There are three fundamentally 
different approaches to the computations: 

1. Calculate posterior model probabilities directly. 

2. Calculate the evidence, p{y\Mk), of each model. 

3. Calculate pairwise evidence ratios. 

Each approach admits a natural SMC strategy. The relative strengths of these approaches, 
which are introduced in the following sections, and alternative methods are identified in 
Table 1. 



3.1.1 SMC 1 : An All-in-One Approach 

One could consider obtaining samples from the same distribution employed in the RJMCMC 
approach to model comparison, namely: 

Tr^^\Mk,ek) oc n{Mk)n{ek\Mk)p{y\ek,Mk) (14) 
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which is defined on the disjoint union space \Jk^ici{Mk} x ©fe)- 

One obvious SMC approach is to define a sequence of distributions {ttJ^^I^q such that 
ttq^^ is easy to sample from, ir^^ = n^^^ and the intermediate distributions move smoothly 
between them. In the remainder of this section, we use the notation {Mt, Ot) to denote a 
random sample on the space UjteK;({^'^fe} ^ ®*:) ^vcae t. One simple approach, which 
might be expected to work well, is the use of an annealing scheme such that: 

4'\Mt,9t) a 7riMt)7T{et\Mt)p{y\et,Mtr^*/^\ (15) 

for some monotonically increasing a : [0, 1] [0, 1] such that a(0) = and a{l) — 1. Other 
approaches are possible and might prove more efficient for some problems (such as the "data 
tempering" approach which Chopin (2002) proposed for parameter estimation which could 
easily be incorporated in our framework), but this strategy provides a convenient generic 
approach. These choices lead to Algorithm 1. 

This approach might outperform FUMCMC when it is difficult to design fast-mixing 
Markov kernels. There are many examples of such an annealed SMC strategy outperform- 
ing MCMC at a given computational cost — see, for example. Fan et al. (2008) ; Johansen 
et al. (2008); Fearnhead and Taylor (2010). Such trans-dimensional SMC has been pro- 
posed in several contexts (Peters, 2005) and an extension proposed and analysed by Jasra 
etal. (2008). 



Algorithm 1 SMCl: An All-in-One Approach to Model Comparison. 

Initialisation: Set t ^ 0. 

Sample X^*^ = {M^'\e^^^) ~ for some proposal distribution i/ (usually the joint prior). 
Weight oc «;o(4'^) = 7r(M«)7r(^«|M«)/KM«, 

Apply resampling if necessary (e.g., if ESS (Kong et al., 1994) less than some threshold). 
Iteration: Set t ^ t + 1. 

Weight Wi'^ oc l^/!\p(2/|02i,M,(i\)"(*/^)-"([*-i]/^). 

Apply resampling if necessary. 

Sample X^^^ ~ Kt{-\xj.'^^), a vr^^'' -invariant kernel. 
Repeat the Iteration step until t = T. 



We include this approach for completeness and study it empirically later However, the 
more direct approaches described in the following sections lead more naturally to easy-to- 
implement strategies with good performance. 

3.1.2 SMC2: A Direct-Evidence-Calculation Approach 

An alternative approach would be to estimate explicitly the evidence associated with each 
model. We propose to do this by sampling from a sequence of distributions for each model: 
starting from the parameter prior and sweeping through a sequence of distributions to the 
posterior 

Numerous strategies are possible to construct such a sequence of distributions, but one 
option is to use for each model Mk, k G K, the sequence {irl^'''^ }^0' defined by 

iri'^'^et) cx 7ri0t\Mk)p{y\0t,Mkr-('/^''l (16) 

where the number of distribution Tk, and the annealing schedule, ak : [0, 1] — >■ [0, 1] maybe 
different for each model. This leads to Algorithm 2. 

The estimator of the posterior model probabilities depends upon the approach taken 
to estimate the normalizing constant. Direct estimation of the evidence can be performed 
using the output of this SMC algorithm and the standard unbiased estimator, termed SMC2- 
DS below: 

E X n E wi^^wief (17) 
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where w/*'/^ is the importance weight of sample i, O^'^i, during iteration t — 1 for model 
Mfe. An alternative approach to computing the evidence is also worthy of consideration. 
As has been suggested, and shown to perform well empirically previously (Johansen et al., 
2006, see, for example), it is possible to use all of the samples from every generation of an 
SMC sampler to approximate the path sampling estimator and hence to obtain an estimate 
of the ratio of normalizing constants. Section 3.2 provides details. 

The posterior distribution of the parameters conditional upon a particular model can 
also be approximated with: 

N 

1=1 

where 5g(k,^) is the Dirac measure. 

This approach is appealing for several reasons. One is that it is designed to estimate 
directly the quantity of interest: the evidence, producing a sample from that distribution at 
the same time. Another advantage of this approach over SMCl and the RJMCMC approach 
is that it provides as good a characterisation of each model as is required: it is possible to 
obtain a good estimate of the parameters of every model, even those for which the poste- 
rior probability is small. Perhaps most significant is the fact that this approach does not 
require the design of proposal distributions or Markov kernels which move from one model 
to another: each model is dealt with in isolation. Whilst this may not be desirable in ev- 
ery situation, there are circumstances in which efficient moves between models are almost 
impossible to devise. 

This approach also has some disadvantages. In particular, it is necessary to run a separate 
simulation for each model — rendering it impossible to deal with countable collections of 
models (although this is not such a substantial problem in many interesting cases) . The ease 
of implementation may often offset this limitation. 



Algorithm 2 SMC2: A Direct-Evidence-Calculation Approach. 
For each model k e IC perform the following algorithm. 
Initialisation: Set t 0. 

Sample ^q'^'*^ ~ Uk for some proposal distribution (usually the parameter prior). 

Weight oc woiet^) = 7r{ei,''''>m)/uk{ei,''% 

Apply resampling if necessary. 
Iteration: Set t t + 1. 

Weight W^f oc w£'/V(y|e}-?,Mjfc)°(*/^*)-°([*-il/^fc). 

Apply resampling if necessary. 

Sample (9f '^^ ~ Kt(-|6'J^f ), a vrf'^^ -invariant kernel. 
Repeat the Iteration step until t = Tk. 



3.1.3 SMC3: A Relative-Evidence-Calculation Approach 

A final approach can be thought of as sequential model comparison. Rather than estimating 
the evidence associated with any particular model, we could estimate pairwise evidence 
ratios direcdy. The SMC sampler starts with a initial distribution being the posterior of one 
model (which could comes from a separate SMC sampler starting from its prior) and moves 
towards the posterior of another related model. Then the sampler can continue towards 
another related model. 

Given a finite collection of models {A/fe}, k € K., suppose the models are ordered in a 
sensible way (e.g., Mk-i is nested within Mk or 0k is of higher dimension than 0k-i)- For 
each fc e /C, we consider a sequence of distributions {ttj^ '"''}^;), such that ■k'^'^\m,6) = 
7r(^|y,Mfc)I{M,}(M) and 4!'''^(^>^) = A6\y,Mk+i)liM,+,}{M) = 4^'''+^\M,e). When 
it is possible to construct a SMC sampler that iterates over this sequence of distributions, the 
estimate of the ratio of normalizing constants is the Bayes factor estimate of model M^+i in 
favour of model M^. 
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This approach is conceptually appealing, but requires the construction of a smooth path 
between the posterior distributions of interest. The geometric annealing strategy which has 
been advocated as a good generic strategy in the previous sections is only appropriate when 
the support of successive distributions is non-increasing. This is unlikely to be the case in 
interesting model comparison problems. 

In this paper we consider a sequence of distributions on the disjoint union {Mk, Qk} U 
{Mfc+i, 0fe+i}, with the sequence of distributions {ttP'^^^I^q defined as the full posterior, 

^^■''HMu Bt) cx 'Kt{Mt)Tv{et\M,)p{y\eu Mt) (18) 

where Mt e {Mf., M^+i} and the prior of models at time t, nt{Mt) is defined by 

MMk+i) = a{t/Tk) (19) 

for some monotonically increasing a : [0, 1] — )■ [0, 1] such that a(0) = and a(l) — 1. It is 
clear that the MCMC moves between iterations need to be similar to those in the RJMCMC 
or SMCl algorithms. The difference is that instead of efficient exploration of the whole 
model space, only moves between two models are required and the sequence of distributions 
employed helps to ensure exploration of both model spaces. The algorithm for this particular 
sequence of distribution is outlined in Algorithm 3. It can be extended to other possible 
sequence of distributions between models. 

An advantage of this approach is that it provides direct estimate of the Bayes factor which 
is of interest for model comparison purpose while not requiring exploration of as compli- 
cated a space as that employed within RJMCMC or SMCl. The estimating of normalizing 
constant in SMC3 can follows in exactly the same manner as in the SMC2 case. In SMC3, 
the same estimator provides a direct estimate of the Bayes factor. 



Algorithm 3 SMC3: A Relative-Evidence-Calculation Approach to Model Comparison. 
Initialisation: Set A; 1. 

Use Algorithm 2 to obtain weighted samples for tt^'^^ the parameter posterior for model 

Ml 

Relative Evidence Calculation 
Set k ^ k + I, t ^ 0. 

Denote current weighted samples as {W^''''\x^''''^}fL^ where X^'''"'> = {M^^'''^ ,6'^^'^) 
Apply resampling if necessary. 
Iteration: Set t ^ t + 1. 

Weight oc 7rt(M^^f )/7^t-i(^t-f )• 

Apply resampling if necessary. 

Sample (Mf ~ i^t(-|Mf f f ), a vrf'^^^ -invariant kernel. 
Repeat the Iteration step up to t = T^. 
Repeat the Relative Evidence Calculation step until sequentially all relative evidences are calcu- 
lated. 



3.2 Path Sampling via SMC2/SMC3 

The estimation of the normalizing constant associated with our sequences of distributions 
can be achieved by a Monte Carlo approximation to the path sampling formulation given 
by Gelman and Meng (1998) (sometimes known as thermodynamic integration or Ogata's 
method). This approach is very closely related to the use of AIS for the same purpose Neal 
(2001) but as will be demonstrated below the incorporation of some other elements of the 
more general SMC algorithm family can improve performance at negligible cost. Given 
a parameter a which defines a family of distributions, {pa = qa/Za}ae[OA] which move 
smoothly from po = qo/Zq to pi = qi/Zi as a increases from zero to one, one can estimate 
the logarithm of the ratio of their normalizing constants via a simple integral relationship 
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which holds under very mild regularity conditions: 



dlogga(-) 

da 



da, (20) 



where denotes expectation under pa', see Gelman and Meng (1998). Note that the 
sequence of distributions in the SMC2 and SMC3 algorithms above, can both be interpreted 
as belonging to such a family of distributions, with at — a{t/Tk), where the mapping a : 
[0, 1] [0, 1] is again monotonic with a(0) = and a(l) = 1. 

The SMC sampler provides us with a set of weighted samples obtained from a sequence 
of distributions suitable for approximating this integral. At each t we can obtain an esti- 
mate of the expectation within the integral for a{t/T) via the usual importance sampling 
estimator, and this integral can then be approximated via numerical integration. Whenever 
the sequence of distributions employed by SMC3 has appropriate differentiability it is also 
possible to employ path sampUng to estimate, direcdy, the evidence ratio via this approach 
applied to the samples generated by that algorithm. In general, given an increasing sequence 
{at}^o where ao = and ar = 1, a family of distributions {pa}ae[o,i\ as before, and a SMC 
sampler that iterates over the sequence of distribution {wt = Pat = I^at }^0' with 
the weighted samples {IF/-'-', Xp-'j^^^, and i = 0, . . . ,T, a path sampling estimator of the 
ratio of normalizing constants St = \og{Zi/Zo) can be approximated (using an elementary 
trapezoidal scheme) by 



§T=E^K-«t-i)(f/f + f/t^i) (21) 



where 



C/^ = V W^^'^ dlogga(^t ) 
* * da 



(22) 



We term these estimators SMC2-PS and SMC3-PS in the followings. The combination of 
SMC and path sampling is somewhat natural and has been proposed before, e.g., Johansen 
et al. (2006) although not there in a Bayesian context. Despite the good performance ob- 
served in the setting of rare event simulation, the estimation of normalizing constants by 
this approach seems to have received little attention in the literature. We suspect that this is 
because of widespread acceptance of the suggestion of Del Moral et al. (2006b), that SMC 
doesn't outperform AIS when normalizing constants are the object of inference or that of 
Calderhead and Girolami (2009) that all simulation-based estimators based around path 
sampUng can be expected to behave similarly. We will demonstrate below that these obser- 
vations, whilst true in certain contexts, do not hold in full generality. 



3.3 Extensions and Refinements 

3.3.1 Improved Univariate Numerical Integration 

As seen in the last section, the path sampling estimator requires evaluation of the expecta- 
tion, 

Ea 1 

L 

for a e [0, 1], which can be approximated by importance sampling using samples generated 
by a SMC sampler operating on the sequence of distributions {nt = Pat = Qat/Zt}tLo directly 
for a e {af}^o- For arbitrary a G [0, 1], finding t such that a G (at_i, at), the expectation 
can be easily approximated using existing SMC samples — the quantities required in the 
importance weights to obtain such an estimate have already been calculated during the 
running of the SMC algorithm and such computations have little computational cost. 

As noted by Friel et al. (2012) we can use more sophisticated numerical integration 
strategies to reduce the path sampling estimator bias. For example, higher order Newton- 
Cotes rules rather than the Trapezoidal rule can be implemented straightforwardly. In the 
case of SMC it is especially straightforward to estimate the required expectations at arbitrary 
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a and so we cheaply use higher order integration schemes and we can also use numerical 
integrations which make use of a finer mesh {a[}J^Q than {at}^o- Since higher order 
numerical integrations based on approximations of derivatives obtained from Monte Carlo 
methods may potentially be unstable in some situations, the second approach can be more 
appealing in some applications. A demonstration of the bias reduction effect is provided in 
Section 4.3. 



3.3.2 Adaptive Specification of Distributions 

In settings in which the importance weights at time t depend only upon the sample at 
time t —1, such as that considered here, it is relatively straightforward to consider sample- 
dependent, adapative specification of the sequence of distributions (typically by choosing 
the value of a tempering parameter, such as at based upon the current sample). Jasra et al. 
(2010) proposed such a method of adaptive placing the distributions in SMC algorithms 
based on controlling the rate at which the effective sample size (ESS; Kong et al. (1994)) 
falls. With very little computation cost, this provides an automatic method of specif5dng 
a tempering schedule in such a way that the ESS decays in a regular fashion. Schafer and 
Chopin (2013, Algorithm 2) used a similar technique but by moving the particle system only 
when it resamples they are in a setting which would be equivalent to resampling at every 
timestep (with longer time steps, followed by multiple applications of the MCMC kernel) in 
our formulation. We advocate resampling only adaptively when ESS is smaller than certain 
preset threshold, and here we propose a more general adaptive scheme for the selection 
of the sequence of distributions which has significantly better properties when adaptive 
resampling is employed. 

The ESS was designed to assess the loss of efficiency arising from the use a simple 
weighted sample (rather than a simple random sample from the distribution of interest) 
in the computation of expectations. It's obtained by considering a sample approximation of 
a low order Taylor expansion of the variance of the importance sampling estimator of an 
arbitrary test function to that of the simple Monte Carlo estimator; the test function itself 
vanishes from the expression as a consequence of this low order expansion. 

In our context, allowing to denote the normalized weights of particle i at the end 

of time t — 1, and to denote the unnormalized incremental weights of particle i during 
iteration t the ESS calculated using the current weight of each particle is simply: 



ESS. 



2-^ I v^-'V TiAk) (k) 



(23) 



It's clearly appropriate to use this quantity (which corresponds to the coefficient of variation 
of the current normalized importance weights) to assess weight degeneracy and to make 
decisions about appropriate resampling times (cf. Del Moral et al. (2012)) but it is rather 
less apparent that it's the correct quantity to consider when adaptively specifying a sequence 
of distributions in an SMC sampler. 

The ESS of the current sample weights tells us about the accumulated mismatch be- 
tween proposal and target distributions (on an extended space including the full trajectory 
of the sample paths) since the last resampling time. Fixing either the relative or absolute 
reduction in ESS between successive distributions does not lead to a common discrepancy 
between successive distributions unless resampling is conducted after every iteration as will 
be demonstrated below. 

When specifying a sequence of distributions it is natural to aim for a similar discrepancy 
between each pair of successive distributions. In the context of effective sample size, the 
natural question to ask is consequently, how large can we make «( - q:/_i whilst ensuring 
that TTt remains sufficiently similar to irt-i. One way to measure the discrepancy would be 
to consider how good an importance sampling proposal irt-i would be for the estimation of 
expectations under ttj and a natural way to measure this is via the sample approximation of 
a Taylor expansion of the relative variance of such an estimator exacdy as in the ESS. 

Such a procedure leads us to a quantity which we have termed the conditional ESS 
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Evolution of distributions using adaptive schedules 
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Figure 1: A typical plot of at — at-i against at (for the Gaussian mixture model example of 
Section 4.1 using the SMC2 algorithm). The specifications of the adaptive parameter (ESS or 
CESS) are adjusted such that all four samplers use roughly the same number of distributions. 



(CESS): 



CESS, = 




(24) 



which is equal to the ESS only when resampling is conducted during every iteration. The 
factor of 1/N in the denominator arises from the fact that is normalized to sum to 

unity rather than to have expectation unity: the bracketed term coincides with a sample 
approximation (using the actual sample which is properly weighted to target iit-i) of the 
expected sum of the unnormalized weights squared divided by the square of a sample ap- 
proximation of the expected sum of unnormalized weights when considering sampling from 
7rt_i and targeting TTf by simple importance sampling. 

Figure 1 shows the variation of at — at-i with at when fixed reductions in ESS and 
CESS are used to specify the sequence of distributions both when resampling is conducted 
during every iteration (or equivalently when the ESS falls below a threshold of 1.0) and 
when resampling is conducted only when the ESS falls below a threshold of 0.5. As is 
demonstrated in Section 4 the CESS-based scheme leads to a reduction in estimator variance 
of around 20% relative to a manually tuned (quadratic; see Section 4.1.1) schedule while 
the ESS-based strategy provides little improvement over the linear case unless resampling is 
conducted during every iteration. 

In addition to providing a significantly better performance at essentially no cost, the 
use of the CESS emphasizes the purpose of the adaptive specification of the sequence of 
distributions: to produce a sequence in which the difference between each successive pair is 
the same (when using the CESS one is seeking to ensure that the variance of the importance 
weights one would arrive at if using -Kt-i as a proposal for ttj is constant). 



3.3.3 Adaptive Specification of Proposals 

The SMC sampler is remarkably robust to the mixing speed of MCMC kernels employed 
as can be seen in the empirical study below. However, as with any sampling algorithms, 
faster mixing doesn't harm performance and in some cases will considerably improve it. 
In the particular case of Metropolis-Hastings kernels, the mixing speed relies on adequate 
proposal scales. 
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We adopt a simpler approach based on Jasra et al. (2010). They appHed an idea used 
within adaptive MCMC methods (Andrieu and Moulines, 2006) to SMC samplers by using 
variance of parameters estimated from its particle system approximation as the proposal 
scale for the next iteration, suitably scaled with reference to the dimension of the parameters 
to be proposed. Although, in practice we found that such an automatic approach does not 
always lead to optimal acceptance rates it generally produces satisfactory results and is 
simple to implement. In difficult problems alternative approaches to adaptation could be 
employed; one approach demonstrated in Jasra et al. (2010) is to simply employ a pair of 
acceptance rate thresholds and to alter the proposal scale from the simply estimated value 
whenever the acceptance rate falls outside those threshold values. 

More sohisticated proposal strategies could undoubtedly improve performance further 
and warrant further investigation. One possible approach is using the Metropolis adjusted 
Langevin algorithm (MALA; see Roberts and Tweedie (1996)). In summary, MALA derives 
a Metropolis-Hastings proposal kernel for a target tt which satisfies suitable differentiability 
and positivity conditions, from the Langevin diffusion. 



where Bi is the standard Brownian motion. Given a state X„_i, a new state is proposed by 
discrete approximation to the above diffusion. That is, for a fixed h> 0, 



where Id is the identify matrix and d is the dimension of the state space. The new proposed 
state is accepted or rejected through the usual Metropolis-Hastings algorithm. Compared 
to a "vanilla" random walk, which often has very robust theoretical properties, MALA is 
attractive when it is possible and its convergence conditions (Roberts and Tweedie, 1996) 
can be met, because only one discrete approximation parameter h needs to be tuned for 
optimal performance. In addition, results from Roberts and Rosenthal (2001) suggested that 
MALA can be more efficient than a random walk when using optimal scalings. We could also 
use the particle approximation at time index t = n—lto estimate the covariance matrix of 7r„ 
and thus tune the scale h on-line. As these algorithms are known to be somewhat sensitive 
to scaUng, and we seek approaches robust enough to employ with little user intervention, 
we have not investigated this strategy here. 

3.4 An Automatic, Generic Algorithm 

With the above refinements, we are ready to implement the SMC2 algorithm with min- 
imal tuning and application-specific effort while providing robust and accurate estimates 
of the model evidence p{y\Mi.). First the geometric annealing scheme that connects the 
prior 7r(^fe|Mfe) and the posterior w{Ok\y,Mk), provides a smooth path for a wide range of 
problems. 

Second, the actual annealing schedule under this scheme can be determined through 
the adaptive schedule as described above. The advantage of the adaptive schedule will be 
shown empirically later 

Third, we can adaptively specify the Metropolis random walk (or MALA) scales through 
the estimation of their scaling parameters as the sampler iterates. In contrast to the MCMC 
setting, where such adaptive algorithms will usually require a burn-in period, which will 
not be used for further estimation, in SMC, the variance and covariance estimates come 
at almost no cost, as all the samples will later be used for marginal likelihood estimation. 
Additionally, adaptation within SMC does not require separate theoretical justification - 
something which can significantly complicate the development of effective, theoretically 
justified schemes in the MCMC setting. Alternatively, we can also specify the proposal scales 
in a deterministic, but sensible way. Since SMC algorithms are relatively robust to the change 
of scales, such deterministic scales will not require the same degree of tuning as is required 
to obtain good performance in MCMC algorithms. 

The adaptive strategy can be applied to both algorithms directly. The applicability to 
the SMC3 algorithm depends on the nature of the sequence of distributions. We outline the 
strategy in Algorithm 4. 





(25) 
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Algorithm 4 An Automatic, Generic Algorithm for Bayesian Model Comparison 

Accuracy control 

Set constant CESS* G (0, 1), using a small pilot simulation if necessary. 
Initialization: Set t <— Q. 

Perform the Initialization step as in Algorithm 1 or 2 
Iteration: Set t ^ t + 1 
Step size selection 

Use a binary search to find a* such that CESSq* = CESS* 
Set at •(- a* if a* < 1, otherwise set 1 
Proposal scale calibration 

Computing the importance sampling estimates of first two moments of parameters. 
Set the proposal scale of the Markov proposal Kt with the estimated parameter vari- 
ances. 

Perform the Iteration step as in Algorithm 1 or 2 with the found at and proposal scales. 
Repeat the Iteration step until at = 1 then set T = t. 



As laid out above, the algorithms require minimal tuning. Its robustness, accuracy and 
efficiency will be shown empirically in Section 4. We shall also note that SMCl is less 
straightforward as the between model moves still require effort to design and implement. 
In SMC3, the specification of the sequences between posterior distributions are less generic 
compared to the geometric annealing scheme in SMC2. However, the adaptive schedule and 
automatic tuning of MCMC proposal scales, both can be applied in these two algorithms in 
principal. 

Although further enhancements and refinements are clearly possible, we focus in the 
remainder of this article on this simple, generic algorithm which can be easily implemented 
in any application and has proved sufficiently powerful to provide good estimation in the 
examples we have encountered thus far. 

4 Illustrative Applications 

In this section, we will use three examples to illustrate the algorithms. The Gaussian mixture 
model is discussed first, with implementations for all three SMC algorithms with comparison 
to RJMCMC and PMCMC. It will be shown that all five algorithms agree on the results 
while the performance in terms of Monte Carlo variance varies considerably. It will also be 
demonstrated how the adaptive refinements of the algorithms behaves in practice. We will 
reach the conclusion that considering ease of implementation, performance and generality, 
the SMC2 algorithm is most promising among all three strategies. 

Then two more realistic examples, a nonlinear ODE model and a Positron Emission To- 
mography compartmental model are used to study the performance and robustness of al- 
gorithm SMC2 compared to AIS and PMCMC. Various configurations of the algorithms are 
considered including both sequential and parallelized implementations. 

The C++ implementations of all examples can be found at https : //github . com/zhouyan/ 
vSMC. 

4.1 Gaussian Mixture Model 

Since Richardson and Green (1997), the Gaussian mixture model (GMM) has provided a 
canonical example of a model-order-determination problem. We use the model formula- 
tion of Del Moral et al. (2006b) to illustrate the efficiency and robustness of the meth- 
ods proposed in this paper compared to other approaches. The model is as follows; data 
y = (yi, . . . , y„) are independendy and identically distributed as 

r 
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where J\f{iJ,j,X~^) denotes the Normal distribution with mean fXj and precision A^; Or = 
{fj-i-.r, Ai:r, wi:r) and r is the number of components in each model. The parameter space is 
thus X (R'^Y X Ar where = {oji-.r : < < 1; J2]=i = 1} is the standard r-simplex. 
The priors which are the same for each component are taken to be fij ~ A/'(^, k,~^), Xj ^ 
x) and uji.r ^ 'D{p) where T>{p) is the sjTnmetric Dirichlet distribution with parameter 
p and Q{ij, \) is the Gamma distribution with shape v and scale x- The prior parameters are 
set in the same manner as in Richardson and Green (1997). Specifically, let j/min and j/max 
be the minimum and maximum of data y, the prior parameters are set such that 

C = (2/max + 2/inin)/2, K = (^max " ?/min)"^, V = 2, % = 50k, p=l 

The data is simulated from a four components model with ^.i-i = (—3, 0, 3, 6), and \j = 2, 
a;,- = 0.25, J = 1,..., 4. 

We consider several algorithms. First the RJMCMC algorithm as in Richardson and Green 
(1997), and second an implementation of the SMCl algorithm. Next AIS, PMCMC and 
SMC2 are used for within-model simulations. The last is an implementation of the SMC3 
algorithm. In all the algorithms, the local move which does not change the dimension of the 
model is constructed as a composition of Metropolis-Hastings random walk kernels: 

• Update pi,r using a multivariate Normal random walk proposal. 

• Update \i;r using a multivariate Normal random walk on logarithmic scale, i.e., on 
logAj, i = l,...,r. 

• Update oji-r using a multivariate Normal random walk on logit scale, i.e., on ujj/ur, 
j = l,...,r-l. 

The RJMCMC, SMCl and SMC3 algorithms use two additional reversible jump moves. The 
first is a combine and split move; the second is a birth and death move. Both are constructed 
in the same manner as in Richardson and Green (1997). Also in these implementations, an 
adjacency condition was imposed on the means i^i-.r, such that /xi < /Lt2 < • • • < A*r- No such 
restriction was used for other algorithms. 

In SMCl, SMC2 and PMCMC implementations, the distributions are chosen with a ge- 
ometric schedule, i.e., as in Equation (15) for SMCl and Equation (16) for the other two. 
This annealing scheme has been used in Del Moral et al. (2006b); Jasra et al. (2007a) and 
many other works. The geometric scheme can also be seen in Calderhead and Girolami 
(2009) for PMCMC tempering. A schedule a{t/T) = {t/T)P, with p = 2 was used. The 
rationale behind this particular schedule can be seen in Calderhead and Girolami (2009) 
and other values of p were also tried while p « 2 performs best in this particular example. 
The adaptive schedule was also implemented for SMC2 and AIS algorithms. 

The proposal scales for each block of the random walks are specified dynamically ac- 
cording to values of a{t/T) for the SMC2 and AIS algorithms and also manually tuned for 
other algorithms such that the acceptance rates fall in [0.2, 0.5]. Later for the SMC2 and AIS 
algorithms, we also consider adaptive schedule of the distribution specification parameter 
a{t/T) and the proposal scales of the random walks. 

For SMC2, SMC3 and AIS we consider both the direct estimator and the path sampling 
estimator. For PMCMC we consider the path sampUng estimator. 

4.1.1 Results 

The SMCl implementation uses 10^ particles and 500 distributions. The RJMCMC imple- 
mentation uses 5 X 10^ iterations in addition to 10^ iterations of bum-in period. The resulting 
estimates of model probabilities are shown in Table 2. 

The SMC2, SMC3 and AIS implementations use 1, 000 particles and 500 iterations. The 
PMCMC implementation uses 50 chains and 10* iterations in addition to 10* iterations of 
burn-in period — these implementations have approximately equal computational costs. 
From the results obtained under the SMCl and RJMCMC algorithms it is clear that, in this 
particular example, simulations for models with fewer than ten components are adequate 
to characterize the model space. Therefore, under this configuration, the cost is roughly the 
same in terms of computational resources as that of the SMCl and RJMCMC algorithms. 
From the results of RJMCMC and SMCl, we consider four and five components models (i.e., 
the true model and the most competitive amongst the others). The estimates are shown 
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Number of components 






Algorithm Quantity 


< 2 


3 4 5 6 


7 


> 8 


SMCl P(M = Mk) 





0.00257 0.886 0.103 0.00715 


0.00128 





log B^^k 


oo 


5.84 2.15 4.82 


6.54 


oo 


RJMCMC P(M = Mk) 





0.000526 0.887 0.103 0.00623 


0.00324 





log B^^k 


oo 


6.56 2.15 4.96 


5.61 


oo 


Table 2: Gaussian mixture model estimates obtained via SMCl and RJMCMC 


Algorithms 


Quamity SMC2-DS SMC2-PS 


SMC3-DS 


SMC3-PS AIS-DS AIS-PS 


PMCMC 


logB4,5 2.15 2.15 


2.16 


2.21 2.16 2.17 


2.63 




SD 0.25 0.22 


0.61 


0.62 1.12 1.10 


0.41 





Table 3: Gaussian mixture model estimates obtained via SMC2, SMC3, AIS and PMCMC 



in Table 3 which, like all of the other tables in this section, summarises the Monte Carlo 
variability of 100 replicate runs of each algorithm.. 

From Tables 2 and 3, it can be seen that the unbiased estimators (RJMCMC, SMCl, 
SMC2-DS, SMC3-DS and AIS-DS) agree with each other Among the path sampling estima- 
tors, SMC2-PS and AIS-PS have little bias. SMC3-PS shows a little more bias. The PMCMC 
algorithm has a considerable larger bias as the number of distributions is relatively small (as 
noted previously, a larger number will negatively affect the mixing speed) . 

In terms of Monte Carlo variance, in Table 3, SMC2 clearly has an advantage compared 
to its no-resampling variant, AIS. The differences of Monte Carlo SD between SMC2, SMC3 
and PMCMC, although they do not affect model selection in this particular example, are 
considerable. 

Effects of resampling It is clear from these results that resampling (when required) can 
substantially improve the estimation of normalising constants within an SMC framework. 
This doesn't contradict the statement in Del Moral et al. (2006b) which suggests that re- 
sampling may not much help when the normalising constant is the object of interest: the 
theoretical argument which supports this relies upon the assumption that the Markov kernel 
used to mutate the particles mixes extremely rapidly and the result is obtained under the 
assumption that resampling is performed after every iteration. When the Markov kernel is 
not so rapidly mixing, the additional stability provided by the resampling operation can out- 
weight the attendant increase in Monte Carlo variance and that is what we observed here 
(and in the case of the other examples considered below; results not shown) . 

Effects of adaptive schedules To assess the evolution of distributions with an adaptive 
schedule, we consider the relation between at — at-i and at- As stated before, one of the 
motivations of using CESS for adaptive placement of distribution is to ensure that at evolves 
the same path regardless the resampling strategies. Figure 1 shows the evolution of at when 
fixing ESS or CESS and resampling every iteration or only when ESS < A^/2. As shown in the 
plot, when fixing CESS, the evolution of the distributions is not affected by the resampling 
strategy. In contrast, fixing ESS yields a sequence of distributions which depends strongly 
upon the resampling strategy. 

In terms of the actual performance when using the CESS adaptive strategy in the SMC2 
and AIS algorithms, a reduction of standard deviation of 20% was observed comparing to 
a{t/T) = {t/Tf, the one shown in Table 3. When applied to the SMC3 algorithm, 50% 
reduction was observed. If the ESS adaptive strategy is used instead, similar standard devi- 
ation reduction is observed when resampling is performed every iteration but no significant 
effect was observed when resampling was only performed when ESS < N/2 (i.e., using ESS 
rather than CESS entirely eliminated the benefit). 
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Effects of adaptive proposal scales When using the SMC2 algorithm, if the adaptive 
strategy of Andrieu and Mouhnes (2006) is apphed, where the important samphng estimates 
of the variance of parameters are included in the adaptation, the acceptance rates fall within 
[0.2, 0.5] dynamically without manual tuning as for the results in Table 3. It should be 
noted that in this particular example, it is the variance of logAj being estimated as the 
corresponding random walk block operates on the log scale. The same principle applies 
to the weight parameters, which have random walks on logit scale. Approximately 20% 
reduction in standard deviation was observed. 



4.2 Nonlinear Ordinary Differential Equations 

The example from the previous section suggests that SMC2 performs well relative to the 
other SMC possibiUties. Given the wide range of settings in which it can be easily deployed, 
we will now concentrate further on this method. It also suggests that in the simple case 
of Gaussian mixtures, a complete adaptive strategy for both distribution specification and 
proposal scales works well. In this section, this will now be further explored in a more 
complex model, a nonlinear ordinary differential equations system. This model, which was 
studied in Calderhead and Girolami (2009), is known as the Goodwin model. The ODE 
system, for an m-component model, is: 

^^^^ = ki-iXi_i{t) - aXi{t) i = 2,...,m 

Xi{0) = i = l,...,m 

The parameters {a, ai, 02, ki:m-i} have common prior distribution 5(0.1, 0.1). Under this 
setting, Xi-mit) can exhibit either unstable oscillation or a constant steady state. The data 
are simulated for m = {3, 5} at equally spaced time points from to 60, with time step 0.5. 
The last 80 data points of {Xi{t),X2{t)) are used for inference. Normally-distributed noise 
with standard deviation a ~ 0.2 is added to the simulated data. Following Calderhead and 
Girolami (2009), the variance of the additive measurement error is assumed to be known. 
Therefore, the posterior distribution has m + 2 parameters for an m-component model. 

As shown in Calderhead and Girolami (2009), when p > 8, due to the possible instability 
of the ODE system, the posterior can have a considerable number of local modes. In this 
example, we set p = 10. Also, as the solution to the ODE system is somewhat unstable, 
slightly different data can result in very different posterior distributions. 



4.2.1 Results 

We compare results from the SMC2 and PMCMC algorithms. For the SMC implementation, 
1,000 particles and 500 iterations were used, with the distributions specified specified by 
Equation (16), with a{t/T) = (t/T)^, or via the completely adaptive specification. For 
the PMCMC algorithm, 50, 000 iterations are performed for burn-in and another 10, 000 
iterations are used for inference. The same tempering as was used for SMC is used here. 
Note that, in a sequential implementation of PMCMC, with each iteration updating one local 
chain and attempting a global exchange, the computational cost of after burn-in iterations 
is roughly the same as the entire SMC algorithm. In addition, changing T within the range 
of the number of cores available does not substantially change the computational cost of a 
generic parallel implementation of the PMCMC algorithm. We compare results from T = 
10,30,100. 

The results for data generated from the simple model (m = 3) and complex model 
(m = 5), again summarising variability amongst 100 runs of each algorithm, are shown in 
Table 4 and 5, respectively. 

As shown in both cases, the number of distributions can affect the performance of PM- 
CMC algorithms considerably. When using 10 distributions, large bias from numerical inte- 
gration for path samphng estimator was observed, as expected. With 30 distributions, the 
performance is comparable to the SMC2 sampler, though some bias is still observable. With 
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Marginal likelihood 

(logp(0fc|y)±SD) 



T 


Proposal 


Annealing 


Algorithm 


m = 3 


m = 5 


Bayes factor 




Scales 


Scheme 








log -83^5 


10 


Manual 


Prior (5) 


PMCMC 


-109.7 ±3.2 


-120.3 ±2.5 


10.6 ±3.8 


30 








-105.0 ±1.2 


-116.1 ± 2.2 


11.2 ± 2.5 


100 








-134.7 ±7.9 


-144.1 ±6.2 


9.4 ± 11.2 


500 


Manual 


Prior (5) 


SMC2-DS 


-104.6 ±2.0 


-112. 7± 1.8 


8.1 ±2.8 








SMC2-PS 


-104.5 ± 1.8 


-112. 7± 1.5 


8.2 ±2.5 


500 


Manual 


Adaptive 


SMC2-DS 


-104.5 ± 1.1 


-112. 7± 1.1 


8.1 ± 1.6 








SMC2-PS 


-104.6 ± 1.0 


-112.8 ± 1.0 


8.2 ± 1.5 


500 


Adaptive 


Adaptive 


SMC2-DS 


-104.5 ±0.5 


-112.7 ±0.4 


8.1 ±0.8 








SMC2-PS 


104.6 ±0.4 


-112.8 ±0.3 


8.1 ± 0.6 



Table 4: Results for non-linear ODE models with data generated from simple model, italic: 
Minimum variance for the same algorithm. Bold: Minimum variance for all samplers. 



Marginal likelihood 

(logp(^fc|y)±SD) 



T 


Proposal 


Annealing 


Algorithm 


771 = 3 


TO = 5 


Bayes factor 




Scales 


Scheme 








log -85^3 


10 


Manual 


Prior (5) 


PMCMC 


-1651.0 ±27.9 


-85.1 ±36.6 


1565.9 ±42.1 


30 








-1639. 7±7. 4 


-78.9 ± 11.2 


1560.8 ± 12.8 


100 








-1624.6 ±15.7 


-75.7 ±24.8 


1548.9 ±25.6 


500 


Manual 


Prior (5) 


SMC2-DS 


-1640.7 ±10.8 


-78.5 ±9.8 


1562.2 ± 10.1 








SMC2-PS 


-1640.8 ±8.4 


-79.2 ± 7.9 


1561.6 ±8.5 


500 


Manual 


Adaptive 


SMC2-DS 


-1639.7 ±6.9 


-78.6 ±4.8 


1561.1 ± 7.1 








SMC2-PS 


-1640.1 ±5.4 


-78.8 ±3.7 


1561.3 ±6.8 


500 


Adaptive 


Adaptive 


SMC2-DS 


-1639.8 ±2.2 


-79.4 ± 1.7 


1560.4 ±3.1 








SMC2-PS 


1640.2 ± 1.9 


-78.5 ± 1.5 


1561. 7± 2.3 



Table 5: Results for non-linear ODE models with data generated from complex model. Number 
italic: Minimum variance for the same algorithm. Bold: Minimum variance for all samplers. 
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100 distributions, there is a much larger variance because, with more chains, the informa- 
tion travels more slowly from rapidly mixing chains to slowly mixing ones and consequently 
the mixing of the overall system is inhibited. 

The SMC algorithm provides results comparable to the best of three PMCMC implemen- 
tations in essentially all settings, including one in which both the annealing schedule and 
proposal scaling were fully automatic. In fact, the completely adaptive strategy was the 
most successful. 

It can be seen that increasing the number of distributions not only reduces the bias of 
numerical integration for path sampling estimator, but also reduces the variance consider- 
ably. On the other hand increasing the number of particles can only reduce the variance of 
the estimates, in accordance with the central limit theorem (see Del Moral et al. (2006b) for 
the standard estimator and extensions for path sampUng estimator. Proposition 1) (as the 
bias arises from the numerical integration scheme) . 

4.3 Positron Emission Tomography Compartmental Model 

It is now interesting to compare the proposed algorithm with other state-of-art algorithms 
using a more realistic example. 

Positron Emission Tomography (PET) is a technique used for studjdng the brain in vivo, 
most typically when investigating metabolism or neuro-chemical concentrations in either 
normal or patient groups. Given the nature and number of observations typically recorded 
in time, PET data is usually modeled with linear differential equation systems. For an 
overview of PET compartmental model see Gunn et al. (2002). Given data (yi, . . .,ynV, 
an m-compartmental model has generative form: 



yj = CT{tj\(t>l:m,6l:m) + \ 7 7 (26) 

CT{tj■Al..m,el■.m)=y2^^ Cp{s)e-'>'^*^-'^ ds (27) 

where tj is the measurement time of yj, ej is additive measurement error and input func- 
tion Cp is (treated as) known. The parameters (j)i,9i, . . . ,(pm,Om characterize the model 
d5Tiamics. See Zhou et al. (2013) for applications of Bayesian model comparison for this 
class of models and details of the specification of the measurement error. In the simulation 
results below, ej are independently and identically distributed according to a zero mean 
Normal distribution of unknown variance, ct^, which was included in the vector of model 
parameters. 

Real neuroscience data sets involve a very large number (~ 200, 000 per brain) of time 
series, which are tj^ically somewhat heterogeneous. Figure 2 shows estimates of Vd = 
J2]Li 4>j/(^j froni a tj^ical PET scan (generated using SMC2 as will be discussed later). 
Robustness is therefore especially important. An application-specific MCMC algorithm was 
developed for this problem in Zhou et al. (2013). A significant amount of tuning of the 
algorithms was required to obtain good results. The results shown in Figure 2 are very close 
to those of Zhou et al. (2013) but, as is shown later, they were obtained with almost no 
manual tuning effort and at similar computational cost. 

For SMC and PMCMC algorithms, the requirement of robustness means that the algo- 
rithm must be able to calibrate itself automatically to different data (and thus different 
posterior surfaces). A sequence of distributions which performs well for one time series 
may not perform even adequately for another series. Specification of proposal scales that 
produces fast-mixing kernels for one data series may lead to slow mixing for another. In 
the following experiment, we will use a single simulated time series, and choose schedules 
that performs both well and poorly for this particular time series. The objective is to see if 
the algorithm can recover from a relatively poorly specified schedule and obtain reasonably 
accurate results. 
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Volume Distribution of Typical PET Data 




Figure 2: Estimates of Vd from a single PET scan as found using SMC2. The data shows that 
the volume of distribution exhibits substantial spatial variation. Note that each pixel in the 
image represent an estimate from an individual time series data set. There are approximately a 
quarter million of them and each requires a Monte Carlo simulation to select a model. 



4.3.1 Results 

In this example we focus on the comparison between SMC2 and PMCMC. We also consider 
parallelized implementations of algorithms. In this case, due to its relatively small number of 
chains, PMCMC can be parallelized completely (and often cannot fully utilize the hardware 
capability if a naive approach to parallelization is taken; while we appreciate that more 
sophisticated parallelization strategies are possible, these depend instrinsicially upon the 
model under investigation and the hardware employed and given our focus on automatic 
and general algorithms, we don't consider such strategies here). The PMCMC algorithm 
under this setting is implemented such that each chain is updated at each iteration. Further, 
for the SMC algorithms, we consider two cases. In the first we can parallelize the algorithm 
completely (in the sense that each core has a single particle associated with it). In this 
setting we use a relatively small number of particles and a larger number of time steps. 
In the second, we need a few passes to process a large number of particles at each time 
step, and accordingly we use fewer time steps to maintain the same total computation time. 
These two settings allow us to investigate the trade-off between the number of particles and 
time steps. In both implementations, we consider three schedules, a{t/T) = t/T (linear), 
a{t/T) = (t/Tf (prior), and a(i/T) = 1 - (1 - t/Tf (posterior). In addition, the adaptive 
schedule based upon CESS is also implemented for the SMC2 algorithm. 

Results from 100 replicate runs of the two algorithms under various regimes can be 
found in Table 6 and 7 for the marginal likelihood and Bayes factor estimates, respectively. 
The SMC algorithms consistently outperforms the PMCMC algorithms in the parallel set- 
tings. The Monte Carlo SD of SMC algorithms is typically of the order of one fifth of the 
corresponding estimates from PMCMC in most scenarios. In some settings with the smaller 
number of samples, the two algorithms can be comparable. Also at the lowest computa- 
tional costs, the samplers with more time steps and fewer particles outperform those with 
the converse configuration by a fairly large margin in terms of estimator variance. It shows 
that with limited resources, ensuring the similarity of consecutive distributions, and thus 
good mixing, can be more beneficial than a larger number of particles. However, when the 
computational budget is increased, the difference becomes negligible. The robustness of 
SMC to the change of schedules is again apparent. 

Effects of adaptive schedule A set of samplers with adaptive schedules are also used. 
Due to the nature of the schedule, it cannot be controlled to have exactly the same number 
of time steps as non-adaptive procedures. However, the CESS was controlled such that the 
average number of time steps are comparable with the fixed schedules and in most cases 
slightly less than the fixed numbers. 
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Proposal scales 



Manual 



Adaptive 



Annealing scheme Prior (5) Posterior (5) Adaptive 



T N Algorithm Marginal likelihood estimates (logp(6'fc|y) ± SD) 



500 


30 


PMCMC 


-39.1 ±0.56 


-926.8 ±376.99 






500 


192 


SMC2-DS 


-39.2 ± 0.25 


-39.7 ±1.06 


-39.2 ± 0.18 


39.1 ± 0.12 






SMC2-PS 


-39.2 ± 0.25 


-91.3 ±21.69 


-39.2 ± 0.18 


-39.1 ±0.13 


100 


960 


SMC2-DS 


-39.3 ±0.36 


-40.6 ± 1.41 


-39.2 ±0.31 


-39.2 ±0.19 






SMC2-PS 


-39.3 ±0.35 


302.1 ±46.29 


-39.3 ±0.31 


-39.2 ±0.18 


1000 


30 


PMCMC 


-39.3 ±0.46 


-884.1 ±307.88 






1000 


192 


SMC2-DS 


-39.2 ± 0.19 


-39.4 ± 0.68 


-39.2 ± 0.17 


39.1 ± 0.10 






SMC2-PS 


-39.2 ± 0.19 


-66.0 ± 13.26 


-39.2 ± 0.17 


39.1 ± 0.10 


200 


960 


SMC2-DS 


-39.2 ±0.22 


-39.8 ± 1.21 


-39.2 ±0.18 


-39.1 ±0.11 






SMC2-PS 


-39.2 ±0.22 


175.5 ±26.84 


-39.2 ±0.18 


-39.2 ±0.11 


2000 


30 


PMCMC 


-39.3 ±0.28 


-928.7 ± 204.93 






2000 


192 


SMC2-DS 


-39.2 ±0.14 


-39.3 ±0.41 


-39.1 ±0.12 


-39.1 ±0.07 






SMC2-PS 


-39.2 ±0.14 


-51.2 ±4.30 


-39.2 ±0.12 


-39.1 ±0.07 


400 


960 


SMC2-DS 


-39.2 ± 0.13 


-39.4 ±0.73 


-39.2 ±0.11 


-39.2 ±0.07 






SMC2-PS 


-39.2 ± 0.13 


106.0 ± 14.36 


-39.2 ± 0.11 


39.2 ±0.06 


5000 


30 


PMCMC 


-39.3 ±0.21 


-917.6 ± 129.54 






5000 


192 


SMC2-DS 


-39.2 ±0.09 


-39.2 ± 0.20 


-39.2 ±0.08 


-39.1 ±0.04 






SMC2-PS 


-39.2 ±0.09 


-43.8 ±2.13 


-39.2 ±0.08 


-39.1 ±0.04 


1000 


960 


SMC2-DS 


-39.2 ± 0.08 


-39.2 ±0.31 


-39.2 ± 0.07 


39.2 ±0.03 






SMC2-PS 


-39.2 ± 0.08 


-65.7 ±5.54 


-39.2 ± 0.07 


39.2 ± 0.03 



Table 6: Marginal likelihood estimates of two components PET model. T: Number of distribu- 
tions in SMC and number of iterations used for inference in PMCMC. : Number of particles in 
SMC and number chains in PMCMC. The PMCMC and SMC with N = 192 are completely A^-way 
parallelized. SMC with = 960 are A^/5-way parallelized. Italic: Minimum variance for the 
same computational cost and the same proposal scales and annealing schemes. Bold: Minimum 
variance for the same computaitonal cost and all proposal scales and annealing schemes. 
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Proposal scales Manual Adaptive 



Annealing scheme 


Prior i^b ) 


Posterior (5) 


Adaptive 


± 


N 

1 V 


A 1 (Toi"i1"h m 
Ul ILillll 




Bayes factor estimates (logi32,i ± SD) 




r An 

500 


oO 


PMLMC 


1. < m u.oz 


-70.9 ± 525.79 






500 


192 


SMC2-DS 


1.6 ±0.27 


1.3 ± 1.13 


1.6±0.20 


1.6 ± 0.15 






SMC2-PS 


1.6±0.27 


-3.9 ±30.02 


1.6±0.20 


1.6 ± 0.15 


100 


960 


SMC2-DS 


1.6 ±0.37 


0.5 ± 1.55 


1.6 ±0.34 


1.6 ±0.21 






SMC2-PS 


1.6 ±0.37 


-13.1 ±66.30 


1.6 ±0.33 


1.6 ±0.21 


1000 


30 


PMCMC 


1.6 ± 0.49 


-67.3 ±400.21 






1000 


192 


SMC2-DS 


1.6±0.21 


1.5±0.79 


1.6 ±0.20 


1.6 ±0.13 






SMC2-PS 


1.6±0.21 


-0.6 ± 15.47 


1.6 ±0.20 


1.6 ±0.13 


200 


960 


SMC2-DS 


1.6 ±0.25 


1.1 ± 1.25 


1.6 ±0.19 


1.6 ±0.12 






SMC2-PS 


1.6 ±0.24 


-11.7 ±34.68 


1.6± 0.18 


1.6 ± 0.11 


2000 


30 


PMCMC 


1.6 ±0.31 


-95.5 ±264.74 






2000 


192 


SMC2-DS 


1.6± 0.14 


1.6±0.U 


1.6 ±0.13 


1.6 ±0.09 






SMC2-PS 


1.6±0.14 


1.6 ±6.06 


1.6 ±0.13 


1.7 ±0.09 


400 


960 


SMC2-DS 


1.6 ±0.16 


1.5 ±0.74 


1.6± 0.12 


1.6 ± 0.08 






SMC2-PS 


1.6 ± 0.16 


-4.2 ± 17.15 


1.6±0.12 


1.6 ±0.08 


5000 


30 


PMCMC 


1.6 ±0.24 


-60.3 ± 198.10 






5000 


192 


SMC2-DS 


1.6 ±0.10 


1.6±0.23 


1.6 ±0.09 


1.6 ±0.05 






SMC2-PS 


1.6 ±0.10 


1.3 ±2.98 


1.6 ±0.09 


1.6 ±0.05 


1000 


960 


SMC2-DS 


1.6±0.09 


1.6 ±0.33 


1.6±0.08 


1.6 ±0.04 






SMC2-PS 


1.6±0.09 


-0.2 ±6.63 


1.6±0.08 


1.6 ±0.04 



Table 7: Bayes factor i?2,i estimates of two components PET model. T: Number of distributions 
in SMC and number of iterations used for inference in PMCMC. N: Number of particles in 
SMC and number chains in PMCMC. The PMCMC and SMC with N = 192 are completely N- 
way parallelized. SMC with = 960 are A^/5-way parallelized. Italic: Minimum variance for 
the same computational cost and the same schedule. Bold: Minimum variance for the same 
computational cost and all schedules. 
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Integration rule 


Number of grid points (compared to sampled iterations) 


xl 


x2 


x4 


x8 


Trapezoid 


-52.2 ±5.01 


-45.5 ± 1.93 


-42.1 ± 1.21 


-40.5 ± 1.06 


Simpson 


-43.2 ± 1.39 


-41.0 ± 1.10 


-40.0 ± 1.04 


-39.4 ± 1.04 


Simpson 3/8 


-42.1 ± 1.21 


-40.5 ± 1.06 


-39.7 ± 1.04 


-39.3 ± 1.04 


Boole 


-40.9 ± 1.09 


-39.9 ± 1.04 


-39.4 ± 1.04 


-39.2 ± 1.05 



Table 8: Path sampling estimator of marginal likelihood of two components PET model. The 
estimator was approximated using samples from SMC2 algorithm with 1,000 particles and 20 
iterations, with different numerical integration strategies. Large sample result (see Table 6) 
shows that an unbiased estimate is —39.2. 



It is found that, with little computational overhead, adaptive schedules do provide the 
best results (or very nearly so) and do so v^rithout user intervention. The reduction of Monte 
Carlo SD varies among different configurations. For moderate or larger number of distri- 
butions, a reduction about 50% was observed. In addition, it shall be noted that, in this 
example, the bias of the path sampling estimates are much more sensitive to the schedules 
than the previous Gaussian mixture model example. A vanilla linear schedule does not pro- 
vide a low bias estimator at all even when the number of distributions is increased to a 
considerably larger number. The prior schedule though provides a nearly unbiased estima- 
tor, there is no clear theoretical evidence showing that this shall work for other situations. 
The adaptive schedule, without any manual calibration, can provide a nearly unbiased esti- 
mator, even when path-sampling is employed, in addition to potential variance reduction. 

Bias reduction for path sampling estimator As seen in Table 6 and 7, a bad choice of 

schedule a{t/T) can results in considerable bias for the basic path sampling estimator, here 
for SMC2-PS but the problem is independent of the mechanism by which the samples are ob- 
tained. Increasing the number of iterations can reduce this bias but at the cost of additional 
computation time. As outlined in Section 3.3.1, in the case of the SMC algorithms discussed 
here, it is possible to reduce the bias without increasing computational cost significantly. 
To demonstrate the bias reduction effect, we constructed SMC sampler for the above PET 
example with only 1,000 particles and about 20 iterations specified using the CESS based 
adaptive strategy. The path sampling estimator was approximated using Equation (21) as 
well as other higher order numerical integration or by integrating over a grid that contains 
{at} at which the samples was generated. The results are shown in Table 8 

Real data results Finally, the methodology of SMC2-PS was applied to measured positron 

emission tomography data using the same compartmental setup as in the simulations. The 
data shown in Figure 2 comes from a study into opioid receptor density in Epilepsy, with 
the data being described in detail in Jiang et al. (2009). It is expected that there will be 
considerable spatial smoothness to the estimates of the volume of distribution, as this is in 
line with the biology of the system being somewhat regional. Some regions will have much 
higher receptor density while others will be much lower, 5delding higher and lower values 
of the volume of distribution, respectively. While we did not impose any spatial smoothness 
but rather estimated the parameters independently for each time series at each spatial loca- 
tion, as can be seen, smooth spatial estimates of the volume of distribution consistent with 
neurological understanding were found using the approach. This method is computationally 
feasible for the entire brain on a voxel-by-voxel basis, due to the ease of parallelization of 
the SMC algorithm. In the analysis performed here, 1000 particles were used, along with an 
adaptive schedule using a constant CESS* = 0.999, resulting in about 180 to 200 intermedi- 
ate distributions. The model selection results are very close to those obtained by a previous 
study of the same data (Zhou et al., 2013), although the present approach requires much 
less implementation effort and has roughly the same computational cost. 
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4.4 Summary 



These three illustrative applications have essentially shown three aspects of using SMC as 
a generic tool for Bayesian model selection. Firstly, as seen in the Gaussian mixture model 
example, all the different variants of SMC proposed, including both direct and path sampling 
versions, produce results which are competitive with other model selection methods such 
as RJMCMC and PMCMC. In addition, in this somewhat simple example, SMC2 performs 
well, and leads to low variance estimates with no appreciable bias. The effect of adaptation 
was studied more carefully in the nonlinear ODE example, and it was shown that using 
both adaptive selection of distributions as well as adaptive proposal variances leads to very 
competitive algorithms, even against those with significant manual tuning. This suggests 
that an automatic process of model selection using SMC2 is possible. In the final example, 
considering the easy parallelization of algorithms such as SMC2 suggests that great gains 
in variance estimation can be made using settings such as GPU computing for application 
where computational resources are of particular importance (such as in image analysis as 
in the PET example). It is also clear that the negligible cost of the bias reduction techniques 
described means that one should always consider using these to reduce the bias inherent in 
path sampling estimation. 



5 Theoretical Considerations 

The convergence results for the standard estimator can be found in Del Moral et al. (2006b) 

and references therein. In this paper, given our advocation of SMC2-PS, we extend the re- 
sults for the path sampling estimator from SMC samplers. Here we present Proposition 1, 
which is specific to path sampling estimator using the simplest Trapezoidal approach to 
numerical integration. It follows as a simple corollary to a more general result given in Ap- 
pendix A which could be used to characterize more general numerical integration schemes. 

Proposition 1. Under the same regularity conditions as are required for the central limit theo- 
rem given in Del Moral et al. (2006b) to hold, given a SMC sampler that iterates over a sequence 
of distributions {nt = qajZa^}f^Q and applies multinomial resampling at each iteration, the 
path sampling estimator, E^, as defined in Equation (21) obeys a central limit theorem in the 



following sense: Let^t(-) = 

A = {at+i - at-\)l% then, provided 



, /3o = ao/2, Pt = aT/2 and for t e {1, . . . ,T - 1} 

«=«* 

is bounded: 



where Vt, <t <T is defined by the following recursion: 



lim Vn{E^ - Et) A AA(0, V^t(?o-t)) (28) 



Voi^o) =^0 / Mxo){^o{xo) - M^o)fdxo (29) 

Vt{^0:t) =Ft_i('cO:t-2,6-l + /* / Kt{;Xt){Uxt)-7Ttm dxt) (30) 



A-i7rt_i(-) 

/ Kt{xt-i,xt){^t{xt) - M^t)f) dxt-i dxf 



We note that much recent analysis of SMC algorithms has focussed on relaxing the rel- 
atively strong assumptions used in the results upon which this result is based — looking 
at more general resampling schemes (Del Moral et al., 2012) and relaxing compactness 
assumptions (Whiteley, 2013) for example. However, we feel that this simple result is suf- 
ficient to show the relationship between the path sampling and simple estimators and that 
in this instance the relatively simplicity of the resulting expression justifies these stronger 
assumptions. 



6 Discussion 

It has been shown that SMC is an effective Monte Carlo method for Bayesian inference for 
the purpose of model comparison. Three approaches have been outlined and investigated in 
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several illustrative applications including the challenging scenarios of nonlinear ODE models 
and PET compartmental systems. The proposed strategy is always competitive and often 
substantially outperforms the state of the art in this area. 

It has been demonstrated that it is possible to use the SMC algorithms to estimate the 
model probabiUties directly (SMCl), or through individual model evidence (SMC2), or pair- 
wise relative evidence (SMC3). In addition, both SMC2 and SMC3 algorithms can be cou- 
pled with the path sampling estimator 

Among the three approaches, SMCl is appUcable to very general settings. It can pro- 
vide a robust alternative to RJMCMC when inference on a countable collection of models is 
required (and could be readily combined with the approach of Jasra et al. (2008) at the ex- 
pense of a little additional implementation effort) . However, Hke all Monte Carlo methods 
involving between model moves, it can be difficult to design efficient algorithms in prac- 
tice. The SMC3 algorithm is conceptually appealing. However, the existence of a suitable 
sequence of distributions between two posterior distributions may not be obvious. 

The SMC2 algorithm, which only involves within-model simulation, is most straightfor- 
ward to implement in many interesting problems. It has been shown to be exceedingly 
robust in many settings. As it depends largely upon a collection of within-model MCMC 
moves, any existing MCMC algorithms can be reused in the SMC2 framework. However, 
much less tuning is required because the algorithm is fundamentally less sensitive to the 
mixing of the Markov kernel and it is possible to implement effective adaptive strategies 
at little computational cost. With adaptive placement of the intermediate distributions and 
specification of the MCMC kernel proposals, it provides a robust and essentially automatic 
model comparison method. 

Compared to the PMCMC algorithm, SMC2 has greater flexibility in the specification of 
distributions. Unlike PMCMC, where the number and placement of distributions can affect 
the mixing speed and hence performance considerably, increasing the number of distribu- 
tions will always benefit a SMC sampler given the same number of particles. When coupled 
with a path sampling estimator, this leads to less bias and variance. Compared to its no- 
resampling variant, it has been shown that SMC samplers with resampling can reduce the 
variance of normalizing constant estimates considerably. 

Even after three decades of intensive development, no Monte Carlo method can solve the 
Bayesian model comparison problem completely automatically without any manual tuning. 
However, SMC algorithms and the adaptive strategies demonstrated in this paper show that 
even for realistic, interesting problems, these samplers can provide good results with very 
minimal tuning and few design difficulties. For many appUcations, they could already be 
used as near automatic, robust solutions. For more challenge problems, the robustness of 
the algorithms can serve as solid foundation for specific algorithm designs. 



A Proof of Proposition 1 

We begin by making some identifications which allow the connection between the SMC 
sampler algorithm presented above and Feynman-Kac formula to be made explicit as the 
proof relies on approaches pioneered in Del Moral (2004). Throughout this appendix we 
write r]K{-) = J r]{dx)K{a:, ■) for any compatible measure rj and Markov kernel K and 
r]{ip) = J r]{dx)ip{x) for any ry-integrable function (p. 

A Feynman-Kac formula describes the law of a Markov chain on{{Et,£t)}t>o (with initial 
distribution 770 and transitions Af/) evolving in the presence of a (time-varying) potential 
(described by Gt) such that the marginal law of the t^^ coordinate is: 

^ , . _ IE,x...xE,^^xAVo('^^o)llLlMi{x^-l,dx^)Gix,) 

for any measurable set A. 

It is convenient to define the operator <&t(r;)(da;t) = Gt{xt)r]Mt{dxt)/riMt{Gt) which 
allows us to write, recursively, fft = ^t{vt-i) and to define the intermediate distributions 
rjt = rjt-iMt such that 7?<(dxt) = G i{x t)ni{dx i) / rji{G t) . 

If X denotes the space upon which an SMC sampler with MCMC proposal Kt at time t 
and sequence of target distributions ttj operates, then we obtain -Kt as the final coordinate 
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marginal of the Feynman-Kac distribution at time t if we identify Et = X*, Mt{xt-i,dxt) = 
Sd;t-i{dxt,i:t-i)Kt{xt,t-i, dxt) and Gt{xt) = TTt{xt,t-i)/7Tt-i{xt,t-i). 

To provide symmetry between the simulation system and the ideal system which it tar- 
gets, it is convenient to let XI denote the extended sample corresponding to XI at iteration t 
together with the full collection of values which its ancestors took during previous iterations 
(i.e., XI corresponds to the particle system obtained by sampling according to Mj above 
rather than Kt at each iteration) . It is then convenient to write the particle approximation 
at time t as 

^^'idx,)=Y,-pi^6^,{dx,). 
i=i 2^3=1 '^A^t) 

We refer the reader to Del Moral (2004) for further details of the connection between such 
particle systems and the Feynman-Kac formula. 

In order to proceed, we prove the following more general result to which Proposition 1 
is a direct corollary. 

Proposition 2. Under the regularity conditions given in (Del Moral, 2004, section 9.4, pp. 

300-306), a weighted sum of integrab obtained from successive generations of the particle 
approximation of a Feynman-Kac flow {rytltLo. ^ith the application of multinomial resampling 
at every iteration, obeys a central limit theorem in the following sense, for a collection of finite 
weights j3t S K and bounded measurable functions S,t'-Et^M. (where, in the historical process 
case described above it is required that S,t{xt) = it{xt,t))- 

T 

viy > PfAvy lit) - mtf.}} 

JV->-oo 



lim VNj2f^t{v^i^t)-Vtm^m,VT{io:T)) (31) 

/V— >-oo ^ — ' 
t=0 

where Vt, <t <T is defined by the following recursion: 

Voi^o) =/3o j Vo{xo){^o{xo) - Vo{£,o))'^dxo 

Vt[^0:t) =n-lK0:t-2,^t-l + "5 ,j x + PtVt \ ^ n • 

V Pt-1 Vt-iMt[Gt) J V Vt{Gt) J 

(32) 

The strategy of the proof is to decompose the error as that propagated forward from 
previous times and that due to sampling at the current time, just as in Del Moral (2004) . 
First note that the term fj^{^t) — Vti^t) can be rewritten as 

(6) - ??t(6) = V^i^t) - + - mte) (33) 

and the weighted sum. 



T;^(€o:t) = ^Y^mfi^i) - mm (34) 

can therefore be written as 

= Tf(6:t)+xf(6) (35) 

where 

(^0:t) = T,^i(eo:t-i) + VNl3t{Mv^_,m - mm (36) 

xf (6) = ^m^iCt) - ^tivf-i)m (37) 

Lemma 1 shows that error propagation leads to controlled normal errors; Lemma 2 

shows that the act of sampling during each iteration also produces a normally-distributed 
error and Lemma 3 shows that these two normal errors can be combined leading by induc- 
tion to Proposition 2. 
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Lemma 1. Under the conditions of Proposition 2, if T^-^{^o;t-i) — > ■A/^(0, Vt-i{^o:t-i)), then 

T^o-.t) ^ f^{0,mo:t)) (38) 

where 

V-V ff f + Md^t^vtim] .39^ 



Proof. We begin by re-expressing the difference of interest in a more convenient form: 

1 

-{(Ci - Vt-i)Mt{Gt[^t - fitm)} (40) 



1 



W-iMtiGt) 

where the final equality is a simple consequence of the fact that for any integrable test 
function ip: 

rit-iMt{Gtv>) = Vt{Gt)vt{^) rjt-iMt{Gt[^t - f?i(6)]) = r,t{Gt)%{^t - 



Substituting this representation into Equation (36), 



= Jt-ll 40:t-2,tt-l + "5 ^Af (r< \ (4i) 

V pt-1 ■n'^_^Mt{Gt) J 

The proof is completed by using the result (Del Moral, 2004, cf. Sec. 7.4.3), that if Gt is 
essentially bounded below then, 

1 V 1 



ntMGt) m-iMt{Gt) 
together with the induction hypothesis. □ 
Lemma 2. Under the conditions of Proposition 2, 

limxf(6) A^(0,yt(6)) (42) 

t— >-CSO 

where 

%{it) = {ilnm-Uit)f) (43) 

Proof. Consider first the particle system before reweighting with the potential function Gt'. 

^^^^ f,(X.'«)-C.M.«.) ^^^„ (44, 

where C/^^- = -^HiO^^P) - ^t-x^Ait)}- Define, recursively the cr-algebras "Hf = n^_^ V 
cr({Xj-''}j^j), "Ht-i = cr (uj^^o^t^i) ^iid the increasing (in j) sequence of a-algebras 
U^j = Ut-x V It is clear that 

E[C/*^-K,-i] = E[[/,^.|Ht_i] = (45) 

and so the sequence Uf^ j = I, . . . ,N comprises a collection of 'H^^ -martingale increments. 
Further it can be verified that these martingale increments are square integrable, 

E[(C/t^-fK,-i]=E[«)'l^*-i] 

= f{CiM*(^,^)-[Ci)(6)]^}<c4 
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where ct < oo exists by the boundedness of ^t. The conditional Linderberg condition is also 
clearly satisfied. That is, for any < u < 1 and e > 0, 

lNu\ 

Thus we have 

and we can invoke the martingale central limit theorem (Shiryaev, 1995, pp. 543), 

lim xf(6) A^(0,14(6)) (46) 

where the asymptotic variance, Vt(Ct), rnay be written as the limit of the sequence defined 
by 

= miiMti^^) - [nl,M,m^} (47) 

and as (again, see (Del Moral, 2004, Section 7.4)) 

Vt^'i^t) A /3f {r7t_iM,(C2) - [r?t_iM,(6)]'} 

the proof is completed using Slutzky's lemma and applying Chopin (2004, Lemma A2) which 
yields that: 

lim A-,^ AAA(0,y,(6)) 

iV— >oo 



with 



□ 



Lemma 3. Under conditions of Proposition 2, and the inductive assumption of Lemma 1, 
T^{^o:t) is asymptotically normal with variance stated as in Proposition 2. 

Proof. Consider the characteristic function, 

^(Tf(^0:t))(5)=E[exp(isT,^(Co:0)] 

= E[exp(islf (^0:*)) exp(isxf (6))] 

= E[exp(i5T,^(^o:*))E[exp(zsxr(6))l^f-i]] 

= E[{At - eM-s^Vt{^t)m)Bt] + exp(-s2yt(6)/2)E[Bt] 

where At = E[exp{isx^' {^t))\'Hf-i] and Bt = exp{isf^{^o:t))- The first term can easily be 
shown to be converge a.s. to zero as A'^ — >^ oo by the asymptotic normality of and the 
conditional independence of the particles at iteration t given 'H^_i. The second term is the 
product of two Gaussian characteristic functions and thus we have that T/^ also follows a 
Gaussian distribution (see the argument of Kiinsch (2005) shown in more detail in Lemma 
10 in Johansen et al. (2006), which uses essentially the same argument, for details). □ 

Using Lemma 1 to 3, the proof of Proposition 2 follows by mathematical induction and 
a trivial base case (the first iteration is simple importance sampling). 



29 



References 



Andrieu, C. and E. Moulines (2006, August). On the ergodicity properties of some adaptive 
MCMC algorithms. The Annab of Applied Probability 16(3), 1462-1505. 

Atchade, Y. F., G. O. Roberts, and J. S. Rosenthal (2010). Towards optimal scaling of 
Metropolis-coupled Markov chain Monte Carlo. Statistics and Computing 21 (4), 555-568. 

Bartolucci, R, L. Scaccia, and A. Mira (2006). Efficient Bayes factor estimation from the 
reversible jump output. Biometrika 93(1), 41-52. 

Bernardo, J. M. and A. R M. Smith (1994). Bayesian Theory. Wiley Series in Probability and 
Statistics. John Wiley & Sons, Inc. 

Calderhead, B. and M. Girolami (2009). Estimating Bayes factors via thermodynamic inte- 
gration and population MCMC. Computational Statistics & Data Analysis 53(12), 4028- 
4045. 

Cappe, O., S. J. Godsill, and E. Moulines (2007). An overview of existing methods and 
recent advances in sequential Monte Carlo. Proceedings of the IEEE 95(5), 899-924. 

Cappe, O., C. P. Robert, and T. Ryden (2003). Reversible jump, birth-and-death and more 
general continuous time Markov chain Monte Carlo samplers. Journal of Royal Statistical 
Society B 65(3), 679-700. 

Carlin, B. P. and S. Chib (1995). Bayesian model choice via Markov chain Monte Carlo 
methods. Journal of Royal Statistical Society B 57(3), 473^84. 

Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statis- 
tical Association 90(432), 1313-1321. 

Chib, S. and I. Jeliazkov (2001). Marginal Ukelihood from the Metropolis-Hastings output. 
Journal of the American Statistical Association 96(453), 270-281. 

Chopin, N. (2002). A sequential particle filter method for static models. Biometrika 89(3), 
539-552. 

Chopin, N. (2004) . Central Umit theorem for sequential Monte Carlo methods and its appU- 
cation to Bayesian inference. The Annals of Statistics 32(6), 2385-2411. 

Del Moral, P. (1996). Nonlinear filtering: interacting particle solution. Markov Processes and 
Related Fields 4(2), 555-580. 

Del Moral, P. (2004). Feynman-Kac Formulae: Genealogical and Interacting Particle Systems 
with Applications. New York: Springer-Verlag. 

Del Moral, P., A. Doucet, and A. Jasra (2006a). Sequential Monte Carlo methods for 
Bayesian computation. In Bayesian Statistics 8. Oxford University Press. 

Del Moral, P., A. Doucet, and A. Jasra (2006b). Sequential Monte Carlo samplers. Journal 
of Royal Statistical Society B 68(3), 411^36. 

Del Moral, P., A. Doucet, and A. Jasra (2012). On adaptive resampling strategies for sequen- 
tial Monte Carlo methods. Bernoulli 18(1), 252-278. 

Didelot, X., R. G. Everitt, A. M. Johansen, and D. J. Lawson (2011). LikeUhood-free estima- 
tion of model evidence. Bayesian Analysis 6(1), 49-76. 

Doucet, A. and A. M. Johansen (2011). A tutorial on particle filtering and smoothing: Fifteen 
years later. In The Oxford Handbook of Non-linear Filtering. Oxford University Press. 

Fan, Y., D. Leslie, and M. P. Wand (2008). Generalised linear mixed model analysis via 
sequential Monte Carlo sampling. Electronic Journal of Statistics 2, 916-938. 



30 



Fearnhead, P. and B. Taylor (2010). An adaptive sequential Monte Carlo sampler. Mathe- 
matics Preprint 1005.1193v2, ArXiv. 

Friel, N., M. Hum, and J. Wyse (2012, September). Improving power posterior estimation 

of statistical evidence. ArXiv 1209.3198, 1-24. 

Gelfand, A. E. and D. K. Dey (1994). Bayesian model choice: Asymptotics and exact calcu- 
lations. Journal of Royal Statistical Society B 56(3), 501-514. 

Gelman, A. and X.-L. Meng (1998). Simulating normalizing constants: From importance 
sampling to bridge sampling to path sampling. Statistical Science 13(2), 163-185. 

Geyer, C. (1991). Monte Carlo maximum likelihood. In Keramigas (Ed.), Proceedings of 
Computing Science and Statistics: The 23rd Symposium on the Interface, Fairfax, pp. 156- 
161. Interface Foundation. 

Godsill, S. J. (2001). On the relationship between Markov chain Monte Carlo for model 
uncerntainty. Journal of Computational and Graphical Statistics J0(2), 230-248. 

Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian 
model determination. Biometrika 82(4), 711-732. 

Grelaud, A., C. P. Robert, J.-M. Marin, F. Rodolphe, and J.-F. Taly (2009). ABC likelihood- 
free methods for model choice in Gibbs random fields. Bccyesian Analysis 4(2), 317-336. 

Grenander, U. and M. I. Miller (1994). Representations of knowledge in complex systems. 
Journal of Royal Statistical Society B 56(4), 549-603. 

Gunn, R. N., S. R. Gunn, F. E. Turkheimer, J. A. D. Aston, and V. J. Cunningham (2002). 
Positron emission tomography compartmental models: A basis pursuit strategy for kinetic 
modeling. Journal of Cerebral Blood Flow & Metabolism 22(12), 1425-1439. 

Jasra, A., A. Doucet, D. A. Stephens, and C. C. Holmes (2008). Interacting sequential Monte 
Carlo samplers for trans-dimensional simulation. Computational Statistics & Data Analy- 
sis 52(,4), 1765-1791. 

Jasra, A., D. A. Stephens, A. Doucet, and T. Tsagaris (2010, December). Inference for Levy- 
Driven Stochastic Volatility Models via Adaptive Sequential Monte Carlo. Scandinavian 
Journal of Statistics 38(1), 1-22. 

Jasra, A., D. A. Stephens, and C. C. Holmes (2007a). On population-based simulation for 
static inference. Statistics and Computing 27(3), 263-279. 

Jasra, A., D. A. Stephens, and C. C. Holmes (2007b). Population-based reversible jump 
Markov chain Monte Carlo. Biometrika 94(4), 787-807. 

Jiang, C.-R., J. A. D. Aston, and J.-L. Wang (2009) . Smoothing dynamic positron emission 
tomography time courses using functional principal components. Neurolmage 47(1), 184- 
193. 

Johansen, A. M., P. Del Moral, and A. Doucet (2006). Sequential Monte Carlo samplers for 
rare events. In Proceedings of the 6th International Workshop on Rare Event Simulation, 
pp. 256-267. 

Johansen, A. M., A. Doucet, and M. Davy (2008) . Particle methods for maximum likelihood 
estimation in latent variable models. Statistics and Computing 28(1), 47-57. 

Johansen, A. M., S. S. Singh, A. Doucet, and B.-N. Vo (2006). Convergence of the SMC 
implementation of the PHD filter. Methodology and Computing in Applied Probability 8(2), 
265-291. 

Kong, A., J. S. Liu, and W. H. Wong (1994). Sequential imputations and Bayesian missing 
data problems. Journal of the American Statistical Association 89(425), 278-288. 



31 



Kiinsch, H. R. (2005). Recursive Monte Carlo filters: Algorithms and theoretical analysis. 
Annals of Statistics 33(5), 1983-2021. 

Lee, A., C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes (2010). On the utility of graphics 
cards to perform massively parallel simulation of advanced Monte Carlo methods. Journal 
of Computational and Graphical Statistics 29(4), 769-789. 

Liang, F. and W. H. Wong (2001, June). Real-parameter evolutionary Monte Carlo with 
applications to Bayesian mixture models. Journal of the American Statistical Associa- 
tion 96(454), 653-666. 

Marinari, E. and G. Parisi (1992). Simulated tempering: a new Monte Carlo scheme. Euro- 
physics Letters 79(6), 451-458. 

Neal, R. M. (1994). Discussion of 'Approximate Bayesian inference with the weighted like- 
lihood bootstrap" by Newton and Raftery. Journal of the Royal Statistical Society, Series 
B 56(1), 41^2. 

Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing 22(2), 125- 
139. 

Newton, M. A. and A. E. Raftery (1994). Approximate Bayesian inference with the weighted 
likelihood bootstrap. Journal of Royal Statistical Society B 56(1), 3-48. 

Peters, G., K. Hayes, and G. Hossack (2010). Ecological non-linear state space model selec- 
tion via adaptive particle Markov chain Monte Carlo (AdPMCMC). Mathematics Preprint 
1005.2238, ArXiv. 

Peters, G. W. (2005) . Topics in sequential Monte Carlo samplers. Master's thesis. University 
of Cambridge, Department of Engineering. 

Raftery A. E., M. A. Newton, J. M. Satagopan, and P. N. Krivitsky (2006, November). Es- 
timating the Integrated Likelihood via Posterior Simulation Using the Harmonic Mean 
Identity. In Bayesian Statistics 8, pp. 1-45. Oxford University Press. 

Richardson, S. and P. J. Green (1997). On Bayesian analysis of mixtures with an unknown 
number of components. Journal of Royal Statistical Society B 59(4), 731-792. 

Robert, C. P. (2007). The Bayesian Choice: From Decision-theoretic Foundations to Computa- 
tional Implementation (2nded.). New York: Springer. 

Robert, C. P., J.-M. Marin, and N. S. Pillai (2011). Why approximate Bayesian computa- 
tional (ABC) methods cannot handle model choice problems. Proceedings of the National 
Academy of Sciences (USA) 108(.37X 15112-15117. 

Roberts, G. O. and J. S. Rosenthal (2001). Optimal scaling for various Metropolis-Hastings 
algorithms. Statistical Science 26(4), 351-367. 

Roberts, G. O. and R. L. Tweedie (1996). Exponential convergence of Langevin distributions 
and their discrete approximations. Bernoulli 2(4), 341-363. 

Rousset, M. and G. Stoltz (2006). Equilibrium sampling from nonequilibrium dynamics. 
Journal of Statistical Physics 123(,6), 1251-1272. 

Schafer, C. and N. Chopin (2013). Sequential Monte Carlo on large binary sampling spaces. 
Statistics and Computing 23 (,2), 163-184. In press. 

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6(2), 
461-464. 

Shiryaev, A. N. (1995). ProbabUily. Graduate Texts in Mathematics. New York: Springer- 
Verlag. 



32 



Stephens, M. (2000). Bayesian analysis of mixture models with an unkown number of 
components - An alternative to reversible jump methods. The Annab of Statistics 28(1), 
40-74. 

Vyshemirsky, V. and M. A. Girolami (2008). Bayesian ranking of biochemical system models. 
Bioinformatics 24{6), 833-839. 

Whiteley N. (2013). Sequential monte carlo samplers: error bounds and insensitivity to 
initial conditions. Stochastic Analysis and Applications 30(5), 774-798. In press. 

Zhou, Y., J. A. D. Aston, and A. M. Johansen (2013). Bayesian model comparison for com- 
partmental models with applications in positron emission tomography. Journal of Applied 
Statistics. In press. 

Zhou, Y., A. M. Johansen, and J. A. D. Aston (2012). Bayesian model selection via path- 
sampling sequential Monte Carlo. In Proceedings of IEEE Statistical Signal Procesing Work- 
shop, Ann Arbor, Michigan, USA, pp. 245-248. 



33 



